diff mbox series

[v15,22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum

Message ID 9e80873fac878aa5d697cbcd4d456d01e1009d1f.1699527082.git.kai.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX host kernel support | expand

Commit Message

Huang, Kai Nov. 9, 2023, 11:55 a.m. UTC
The first few generations of TDX hardware have an erratum.  Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.

== Solution ==

In the end, this issue is hard to trigger.  Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.

Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:

 [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
 	...
 [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] Kernel panic - not syncing: Fatal local machine check

Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".

Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error".  But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error.  Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.

So keep the "Hardware Error".  The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.

Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:

  ...
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.

Adding this additional message requires determination of whether the
memory page is TDX private memory.  There is no existing infrastructure
to do that.  Add an interface to query the TDX module to fill this gap.

== Impact ==

This issue requires some kind of kernel bug to trigger.

TDX private memory should never be mapped UC/WC.  A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory.  It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.

MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map.  It should also be
detectable with normal debug techniques.

The one place where this might get nasty would be if the CPU read data
then wrote back the same data.  That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.

With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
---

v14 -> v15:
 - No change

v13 -> v14:
 - No change

v12 -> v13:
 - Added Kirill and Yuan's tag.

v11 -> v12:
 - Simplified #MC message (Dave/Kirill)
 - Slightly improved some comments.

v10 -> v11:
 - New patch


---
 arch/x86/include/asm/tdx.h     |   2 +
 arch/x86/kernel/cpu/mce/core.c |  33 +++++++++++
 arch/x86/virt/vmx/tdx/tdx.c    | 103 +++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h    |   5 ++
 4 files changed, 143 insertions(+)

Comments

Tony Luck Nov. 30, 2023, 6:01 p.m. UTC | #1
On Fri, Nov 10, 2023 at 12:55:59AM +1300, Kai Huang wrote:
> Instead of modifying above error log, improve the error log by printing
> additional TDX related message to make the log like:
> 
>   ...
>  [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
>  [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.

This seems a reasonable addition.

>  arch/x86/kernel/cpu/mce/core.c |  33 +++++++++++

Reviewed-by: Tony Luck <tony.luck@intel.com>

[I only reviewed the hooks into mce/core.c I don't feel qualified
to dig through the TDX bits that determine this is a TD private page]

-Tony
Dave Hansen Dec. 1, 2023, 8:35 p.m. UTC | #2
On 11/9/23 03:55, Kai Huang wrote:
> +static bool is_pamt_page(unsigned long phys)
> +{
> +	struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
> +	int i;
> +
> +	/*
> +	 * This function is called from #MC handler, and theoretically
> +	 * it could run in parallel with the TDX module initialization
> +	 * on other logical cpus.  But it's not OK to hold mutex here
> +	 * so just blindly check module status to make sure PAMTs/TDMRs
> +	 * are stable to access.
> +	 *
> +	 * This may return inaccurate result in rare cases, e.g., when
> +	 * #MC happens on a PAMT page during module initialization, but
> +	 * this is fine as #MC handler doesn't need a 100% accurate
> +	 * result.
> +	 */

It doesn't need perfect accuracy.  But how do we know it's not going to
go, for instance, chase a bad pointer?

> +	if (tdx_module_status != TDX_MODULE_INITIALIZED)
> +		return false;

As an example, what prevents this CPU from observing
tdx_module_status==TDX_MODULE_INITIALIZED while the PAMT structure is
being assembled?

> +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> +		unsigned long base, size;
> +
> +		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
> +
> +		if (phys >= base && phys < (base + size))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * Return whether the memory page at the given physical address is TDX
> + * private memory or not.  Called from #MC handler do_machine_check().
> + *
> + * Note this function may not return an accurate result in rare cases.
> + * This is fine as the #MC handler doesn't need a 100% accurate result,
> + * because it cannot distinguish #MC between software bug and real
> + * hardware error anyway.
> + */
> +bool tdx_is_private_mem(unsigned long phys)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = phys & PAGE_MASK,
> +	};
> +	u64 sret;
> +
> +	if (!platform_tdx_enabled())
> +		return false;
> +
> +	/* Get page type from the TDX module */
> +	sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
> +	/*
> +	 * Handle the case that CPU isn't in VMX operation.
> +	 *
> +	 * KVM guarantees no VM is running (thus no TDX guest)
> +	 * when there's any online CPU isn't in VMX operation.
> +	 * This means there will be no TDX guest private memory
> +	 * and Secure-EPT pages.  However the TDX module may have
> +	 * been initialized and the memory page could be PAMT.
> +	 */
> +	if (sret == TDX_SEAMCALL_UD)
> +		return is_pamt_page(phys);

Either this is comment is wonky or the module initialization is buggy.

config_global_keyid() goes and does SEAMCALLs on all CPUs.  There are
zero checks or special handling in there for whether the CPU has done
VMXON.  So, by the time we've started initializing the TDX module
(including the PAMT), all online CPUs must be able to do SEAMCALLs.  Right?

So how can we have a working PAMT here when this CPU can't do SEAMCALLs?

I don't think we should even bother with this complexity.  I think we
can just fix the whole thing by saying that unless you can make a
non-init SEAMCALL, we just assume the memory can't be private.

The transition to being able to make non-init SEAMCALLs is also #MC safe
*and* it's at a point when the tdmr_list is stable.

Can anyone shoot any holes in that? :)
Huang, Kai Dec. 3, 2023, 11:44 a.m. UTC | #3
On Fri, 2023-12-01 at 12:35 -0800, Hansen, Dave wrote:
> On 11/9/23 03:55, Kai Huang wrote:
> > +static bool is_pamt_page(unsigned long phys)
> > +{
> > +	struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
> > +	int i;
> > +
> > +	/*
> > +	 * This function is called from #MC handler, and theoretically
> > +	 * it could run in parallel with the TDX module initialization
> > +	 * on other logical cpus.  But it's not OK to hold mutex here
> > +	 * so just blindly check module status to make sure PAMTs/TDMRs
> > +	 * are stable to access.
> > +	 *
> > +	 * This may return inaccurate result in rare cases, e.g., when
> > +	 * #MC happens on a PAMT page during module initialization, but
> > +	 * this is fine as #MC handler doesn't need a 100% accurate
> > +	 * result.
> > +	 */
> 
> It doesn't need perfect accuracy.  But how do we know it's not going to
> go, for instance, chase a bad pointer?
> 
> > +	if (tdx_module_status != TDX_MODULE_INITIALIZED)
> > +		return false;
> 
> As an example, what prevents this CPU from observing
> tdx_module_status==TDX_MODULE_INITIALIZED while the PAMT structure is
> being assembled?

There are two types of memory order serializing operations between assembling
the TDMR/PAMT structure and setting the tdx_module_status to
TDX_MODULE_INITIALIZED: 1) wbvind_on_all_cpus(); 2) bunch of SEAMCALLs;

WBINVD is a serializing instruction.  SEAMCALL is a VMEXIT to the TDX module,
which involves GDT/LDT/control registers/MSRs switch so it is also a serializing
operation.

But perhaps we can explicitly add a smp_wmb() between assembling TDMR/PAMT
structure and setting tdx_module_status if that's better.

> 
> > +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> > +		unsigned long base, size;
> > +
> > +		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
> > +
> > +		if (phys >= base && phys < (base + size))
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> > +/*
> > + * Return whether the memory page at the given physical address is TDX
> > + * private memory or not.  Called from #MC handler do_machine_check().
> > + *
> > + * Note this function may not return an accurate result in rare cases.
> > + * This is fine as the #MC handler doesn't need a 100% accurate result,
> > + * because it cannot distinguish #MC between software bug and real
> > + * hardware error anyway.
> > + */
> > +bool tdx_is_private_mem(unsigned long phys)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = phys & PAGE_MASK,
> > +	};
> > +	u64 sret;
> > +
> > +	if (!platform_tdx_enabled())
> > +		return false;
> > +
> > +	/* Get page type from the TDX module */
> > +	sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
> > +	/*
> > +	 * Handle the case that CPU isn't in VMX operation.
> > +	 *
> > +	 * KVM guarantees no VM is running (thus no TDX guest)
> > +	 * when there's any online CPU isn't in VMX operation.
> > +	 * This means there will be no TDX guest private memory
> > +	 * and Secure-EPT pages.  However the TDX module may have
> > +	 * been initialized and the memory page could be PAMT.
> > +	 */
> > +	if (sret == TDX_SEAMCALL_UD)
> > +		return is_pamt_page(phys);
> 
> Either this is comment is wonky or the module initialization is buggy.
> 
> config_global_keyid() goes and does SEAMCALLs on all CPUs.  There are
> zero checks or special handling in there for whether the CPU has done
> VMXON.  So, by the time we've started initializing the TDX module
> (including the PAMT), all online CPUs must be able to do SEAMCALLs.  Right?
> 
> So how can we have a working PAMT here when this CPU can't do SEAMCALLs?

The corner case is KVM can enable VMX on all cpus, initialize the TDX module,
and then disable VMX on all cpus.  One example is KVM can be unloaded after it
initializes the TDX module.

In this case CPU cannot do SEAMCALL but PAMTs are already working :-)

However if SEAMCALL cannot be made (due to out of VMX), then the module can only
be initialized or the initialization hasn't been tried, so both
tdx_module_status and the tdx_tdmr_list are stable to access.

> 
> I don't think we should even bother with this complexity.  I think we
> can just fix the whole thing by saying that unless you can make a
> non-init SEAMCALL, we just assume the memory can't be private.
> 
> The transition to being able to make non-init SEAMCALLs is also #MC safe
> *and* it's at a point when the tdmr_list is stable.
> 
> Can anyone shoot any holes in that? :)
Dave Hansen Dec. 4, 2023, 5:07 p.m. UTC | #4
On 12/3/23 03:44, Huang, Kai wrote:
...
>> It doesn't need perfect accuracy.  But how do we know it's not going to
>> go, for instance, chase a bad pointer?
>>
>>> +   if (tdx_module_status != TDX_MODULE_INITIALIZED)
>>> +           return false;
>>
>> As an example, what prevents this CPU from observing
>> tdx_module_status==TDX_MODULE_INITIALIZED while the PAMT structure is
>> being assembled?
> 
> There are two types of memory order serializing operations between assembling
> the TDMR/PAMT structure and setting the tdx_module_status to
> TDX_MODULE_INITIALIZED: 1) wbvind_on_all_cpus(); 2) bunch of SEAMCALLs;
> 
> WBINVD is a serializing instruction.  SEAMCALL is a VMEXIT to the TDX module,
> which involves GDT/LDT/control registers/MSRs switch so it is also a serializing
> operation.
> 
> But perhaps we can explicitly add a smp_wmb() between assembling TDMR/PAMT
> structure and setting tdx_module_status if that's better.

... and there's zero documentation of this dependency because ... ?

I suspect it's because it was never looked at until Tony made a comment
about it and we started looking at it.  In other words, it worked by
coincidence.

>>> +   for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
>>> +           unsigned long base, size;
>>> +
>>> +           tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
>>> +
>>> +           if (phys >= base && phys < (base + size))
>>> +                   return true;
>>> +   }
>>> +
>>> +   return false;
>>> +}
>>> +
>>> +/*
>>> + * Return whether the memory page at the given physical address is TDX
>>> + * private memory or not.  Called from #MC handler do_machine_check().
>>> + *
>>> + * Note this function may not return an accurate result in rare cases.
>>> + * This is fine as the #MC handler doesn't need a 100% accurate result,
>>> + * because it cannot distinguish #MC between software bug and real
>>> + * hardware error anyway.
>>> + */
>>> +bool tdx_is_private_mem(unsigned long phys)
>>> +{
>>> +   struct tdx_module_args args = {
>>> +           .rcx = phys & PAGE_MASK,
>>> +   };
>>> +   u64 sret;
>>> +
>>> +   if (!platform_tdx_enabled())
>>> +           return false;
>>> +
>>> +   /* Get page type from the TDX module */
>>> +   sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
>>> +   /*
>>> +    * Handle the case that CPU isn't in VMX operation.
>>> +    *
>>> +    * KVM guarantees no VM is running (thus no TDX guest)
>>> +    * when there's any online CPU isn't in VMX operation.
>>> +    * This means there will be no TDX guest private memory
>>> +    * and Secure-EPT pages.  However the TDX module may have
>>> +    * been initialized and the memory page could be PAMT.
>>> +    */
>>> +   if (sret == TDX_SEAMCALL_UD)
>>> +           return is_pamt_page(phys);
>>
>> Either this is comment is wonky or the module initialization is buggy.
>>
>> config_global_keyid() goes and does SEAMCALLs on all CPUs.  There are
>> zero checks or special handling in there for whether the CPU has done
>> VMXON.  So, by the time we've started initializing the TDX module
>> (including the PAMT), all online CPUs must be able to do SEAMCALLs.  Right?
>>
>> So how can we have a working PAMT here when this CPU can't do SEAMCALLs?
> 
> The corner case is KVM can enable VMX on all cpus, initialize the TDX module,
> and then disable VMX on all cpus.  One example is KVM can be unloaded after it
> initializes the TDX module.
> 
> In this case CPU cannot do SEAMCALL but PAMTs are already working :-)
> 
> However if SEAMCALL cannot be made (due to out of VMX), then the module can only
> be initialized or the initialization hasn't been tried, so both
> tdx_module_status and the tdx_tdmr_list are stable to access.

None of this even matters.  Let's remind ourselves how unbelievably
unlikely this is:

1. You're on an affected system that has the erratum
2. The KVM module gets unloaded, runs vmxoff
3. A kernel bug using a very rare partial write corrupts the PAMT
4. A second bug reads the PAMT consuming poison, #MC is generated
5. Enter #MC handler, SEAMCALL fails
6. #MC handler just reports a plain hardware error

The only thing even remotely wrong with this situation is that the
report won't pin the #MC on TDX.  Play stupid games (removing modules),
win stupid prizes (worse error message).

Can we dynamically mark a module as unsafe to remove?  If so, I'd
happily just say that we should make kvm_intel.ko unsafe to remove when
TDX is supported and move on with life.

tl;dr: I think even looking a #MC on the PAMT after the kvm module is
removed is a fool's errand.
Huang, Kai Dec. 4, 2023, 9 p.m. UTC | #5
On Mon, 2023-12-04 at 09:07 -0800, Dave Hansen wrote:
> On 12/3/23 03:44, Huang, Kai wrote:
> ...
> > > It doesn't need perfect accuracy.  But how do we know it's not going to
> > > go, for instance, chase a bad pointer?
> > > 
> > > > +   if (tdx_module_status != TDX_MODULE_INITIALIZED)
> > > > +           return false;
> > > 
> > > As an example, what prevents this CPU from observing
> > > tdx_module_status==TDX_MODULE_INITIALIZED while the PAMT structure is
> > > being assembled?
> > 
> > There are two types of memory order serializing operations between assembling
> > the TDMR/PAMT structure and setting the tdx_module_status to
> > TDX_MODULE_INITIALIZED: 1) wbvind_on_all_cpus(); 2) bunch of SEAMCALLs;
> > 
> > WBINVD is a serializing instruction.  SEAMCALL is a VMEXIT to the TDX module,
> > which involves GDT/LDT/control registers/MSRs switch so it is also a serializing
> > operation.
> > 
> > But perhaps we can explicitly add a smp_wmb() between assembling TDMR/PAMT
> > structure and setting tdx_module_status if that's better.
> 
> ... and there's zero documentation of this dependency because ... ?
> 
> I suspect it's because it was never looked at until Tony made a comment
> about it and we started looking at it.  In other words, it worked by
> coincidence.

I should have put a comment around here.  My bad.

Kirill also helped to look at the code.

> 
> > > > +   for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> > > > +           unsigned long base, size;
> > > > +
> > > > +           tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
> > > > +
> > > > +           if (phys >= base && phys < (base + size))
> > > > +                   return true;
> > > > +   }
> > > > +
> > > > +   return false;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Return whether the memory page at the given physical address is TDX
> > > > + * private memory or not.  Called from #MC handler do_machine_check().
> > > > + *
> > > > + * Note this function may not return an accurate result in rare cases.
> > > > + * This is fine as the #MC handler doesn't need a 100% accurate result,
> > > > + * because it cannot distinguish #MC between software bug and real
> > > > + * hardware error anyway.
> > > > + */
> > > > +bool tdx_is_private_mem(unsigned long phys)
> > > > +{
> > > > +   struct tdx_module_args args = {
> > > > +           .rcx = phys & PAGE_MASK,
> > > > +   };
> > > > +   u64 sret;
> > > > +
> > > > +   if (!platform_tdx_enabled())
> > > > +           return false;
> > > > +
> > > > +   /* Get page type from the TDX module */
> > > > +   sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
> > > > +   /*
> > > > +    * Handle the case that CPU isn't in VMX operation.
> > > > +    *
> > > > +    * KVM guarantees no VM is running (thus no TDX guest)
> > > > +    * when there's any online CPU isn't in VMX operation.
> > > > +    * This means there will be no TDX guest private memory
> > > > +    * and Secure-EPT pages.  However the TDX module may have
> > > > +    * been initialized and the memory page could be PAMT.
> > > > +    */
> > > > +   if (sret == TDX_SEAMCALL_UD)
> > > > +           return is_pamt_page(phys);
> > > 
> > > Either this is comment is wonky or the module initialization is buggy.
> > > 
> > > config_global_keyid() goes and does SEAMCALLs on all CPUs.  There are
> > > zero checks or special handling in there for whether the CPU has done
> > > VMXON.  So, by the time we've started initializing the TDX module
> > > (including the PAMT), all online CPUs must be able to do SEAMCALLs.  Right?
> > > 
> > > So how can we have a working PAMT here when this CPU can't do SEAMCALLs?
> > 
> > The corner case is KVM can enable VMX on all cpus, initialize the TDX module,
> > and then disable VMX on all cpus.  One example is KVM can be unloaded after it
> > initializes the TDX module.
> > 
> > In this case CPU cannot do SEAMCALL but PAMTs are already working :-)
> > 
> > However if SEAMCALL cannot be made (due to out of VMX), then the module can only
> > be initialized or the initialization hasn't been tried, so both
> > tdx_module_status and the tdx_tdmr_list are stable to access.
> 
> None of this even matters.  Let's remind ourselves how unbelievably
> unlikely this is:
> 
> 1. You're on an affected system that has the erratum
> 2. The KVM module gets unloaded, runs vmxoff
> 3. A kernel bug using a very rare partial write corrupts the PAMT
> 4. A second bug reads the PAMT consuming poison, #MC is generated
> 5. Enter #MC handler, SEAMCALL fails
> 6. #MC handler just reports a plain hardware error

Yes totally agree it is very unlikely to happen.  

> 
> The only thing even remotely wrong with this situation is that the
> report won't pin the #MC on TDX.  Play stupid games (removing modules),
> win stupid prizes (worse error message).
> 
> Can we dynamically mark a module as unsafe to remove?  If so, I'd
> happily just say that we should make kvm_intel.ko unsafe to remove when
> TDX is supported and move on with life.
> 
> tl;dr: I think even looking a #MC on the PAMT after the kvm module is
> removed is a fool's errand.

Sorry I wasn't clear enough.  KVM actually turns off VMX when it destroys the
last VM, so the KVM module doesn't need to be removed to turn off VMX.  I used
"KVM can be unloaded" as an example to explain the PAMT can be working when VMX
is off.
Dave Hansen Dec. 4, 2023, 10:04 p.m. UTC | #6
On 12/4/23 13:00, Huang, Kai wrote:
>> tl;dr: I think even looking a #MC on the PAMT after the kvm module is
>> removed is a fool's errand.
> Sorry I wasn't clear enough.  KVM actually turns off VMX when it destroys the
> last VM, so the KVM module doesn't need to be removed to turn off VMX.  I used
> "KVM can be unloaded" as an example to explain the PAMT can be working when VMX
> is off.

Can't we just fix this by having KVM do an "extra" hardware_enable_all()
before initializing the TDX module?  It's not wrong to say that TDX is a
KVM user.  If KVm wants 'kvm_usage_count' to go back to 0, it can shut
down the TDX module.  Then there's no PAMT to worry about.

The shutdown would be something like:

	1. TDX module shutdown
	2. Deallocate/Convert PAMT
	3. vmxoff

Then, no SEAMCALL failure because of vmxoff can cause a PAMT-induced #MC
to be missed.
Huang, Kai Dec. 4, 2023, 11:24 p.m. UTC | #7
On Mon, 2023-12-04 at 14:04 -0800, Hansen, Dave wrote:
> On 12/4/23 13:00, Huang, Kai wrote:
> > > tl;dr: I think even looking a #MC on the PAMT after the kvm module is
> > > removed is a fool's errand.
> > Sorry I wasn't clear enough.  KVM actually turns off VMX when it destroys the
> > last VM, so the KVM module doesn't need to be removed to turn off VMX.  I used
> > "KVM can be unloaded" as an example to explain the PAMT can be working when VMX
> > is off.
> 
> Can't we just fix this by having KVM do an "extra" hardware_enable_all()
> before initializing the TDX module?  
> 

Yes KVM needs to do hardware_enable_all() anyway before initializing the TDX
module.  

I believe you mean we can keep VMX enabled after initializing the TDX module,
i.e., not calling hardware_disable_all() after that, so that kvm_usage_count
will remain non-zero even when last VM is destroyed?

The current behaviour that KVM only enable VMX when there's active VM is because
it (or the kernel) wants to allow to be able to load and run third-party VMX
module (yes the virtual BOX) when KVM module is loaded.  Only one of them can
actually use the VMX hardware but they can be both loaded.

In ancient time KVM used to immediately enable VMX when it is loaded, but later
it was changed to only enable VMX when there's active VM because of the above
reason.

See commit 10474ae8945ce ("KVM: Activate Virtualization On Demand").

> It's not wrong to say that TDX is a
> KVM user.  If KVm wants 'kvm_usage_count' to go back to 0, it can shut
> down the TDX module.  Then there's no PAMT to worry about.
> 
> The shutdown would be something like:
> 
> 	1. TDX module shutdown
> 	2. Deallocate/Convert PAMT
> 	3. vmxoff
> 
> Then, no SEAMCALL failure because of vmxoff can cause a PAMT-induced #MC
> to be missed.

The limitation is once the TDX module is shutdown, it cannot be initialized
again unless it is runtimely updated.

Long-termly, if we go this design then there might be other problems when other
kernel components are using TDX.  For example, the VT-d driver will need to be
changed to support TDX-IO, and it will need to enable TDX module much earlier
than KVM to do some initialization.  It might need to some TDX work (e.g.,
cleanup) while KVM is unloaded.  I am not super familiar with TDX-IO but looks
we might have some problem here if we go with such design.
Dave Hansen Dec. 4, 2023, 11:39 p.m. UTC | #8
On 12/4/23 15:24, Huang, Kai wrote:
> On Mon, 2023-12-04 at 14:04 -0800, Hansen, Dave wrote:
...
> In ancient time KVM used to immediately enable VMX when it is loaded, but later
> it was changed to only enable VMX when there's active VM because of the above
> reason.
> 
> See commit 10474ae8945ce ("KVM: Activate Virtualization On Demand").

Fine.  This doesn't need to change ... until you load TDX.  Once you
initialize the TDX module, no more out-of-tree VMMs for you.

That doesn't seem too insane.  This is yet *ANOTHER* reason that doing
dynamic TDX module initialization is a good idea.

>> It's not wrong to say that TDX is a
>> KVM user.  If KVm wants 'kvm_usage_count' to go back to 0, it can shut
>> down the TDX module.  Then there's no PAMT to worry about.
>>
>> The shutdown would be something like:
>>
>>       1. TDX module shutdown
>>       2. Deallocate/Convert PAMT
>>       3. vmxoff
>>
>> Then, no SEAMCALL failure because of vmxoff can cause a PAMT-induced #MC
>> to be missed.
> 
> The limitation is once the TDX module is shutdown, it cannot be initialized
> again unless it is runtimely updated.
> 
> Long-termly, if we go this design then there might be other problems when other
> kernel components are using TDX.  For example, the VT-d driver will need to be
> changed to support TDX-IO, and it will need to enable TDX module much earlier
> than KVM to do some initialization.  It might need to some TDX work (e.g.,
> cleanup) while KVM is unloaded.  I am not super familiar with TDX-IO but looks
> we might have some problem here if we go with such design.

The burden for who does vmxon will simply need to change from KVM itself
to some common code that KVM depends on.  Probably not dissimilar to
those nutty (sorry folks, just calling it as I see 'em) multi-KVM module
patches that are floating around.
Huang, Kai Dec. 4, 2023, 11:41 p.m. UTC | #9
On Mon, 2023-12-04 at 23:24 +0000, Huang, Kai wrote:
> Long-termly, if we go this design then there might be other problems when other
> kernel components are using TDX.  For example, the VT-d driver will need to be
> changed to support TDX-IO, and it will need to enable TDX module much earlier
> than KVM to do some initialization.  It might need to some TDX work (e.g.,
> cleanup) while KVM is unloaded.  I am not super familiar with TDX-IO but looks
> we might have some problem here if we go with such design. 

Perhaps I shouldn't use the future feature as argument, e.g., with multiple TDX
users we are likely to have a refcount to see whether we can truly shutdown TDX.

And VMX on/off will also need to be moved out of KVM for these work.

But the point is it's better to not assume how these kernel components will use
VMX on/off.  E.g., it may just choose to simply turn on VMX, do SEMACALL, and
then turn off VMX immediately.  While the TDX module will be alive all the time.

Keeping VMX on will suppress INIT, I guess that's another reason we prefer to
turning VMX on when needed.

/*      
 * Disable virtualization, i.e. VMX or SVM, to ensure INIT is recognized during
 * reboot.  VMX blocks INIT if the CPU is post-VMXON, and SVM blocks INIT if
 * GIF=0, i.e. if the crash occurred between CLGI and STGI.
 */
void cpu_emergency_disable_virtualization(void)
{
	...
}
Huang, Kai Dec. 4, 2023, 11:56 p.m. UTC | #10
On Mon, 2023-12-04 at 15:39 -0800, Dave Hansen wrote:
> On 12/4/23 15:24, Huang, Kai wrote:
> > On Mon, 2023-12-04 at 14:04 -0800, Hansen, Dave wrote:
> ...
> > In ancient time KVM used to immediately enable VMX when it is loaded, but later
> > it was changed to only enable VMX when there's active VM because of the above
> > reason.
> > 
> > See commit 10474ae8945ce ("KVM: Activate Virtualization On Demand").
> 
> Fine.  This doesn't need to change ... until you load TDX.  Once you
> initialize the TDX module, no more out-of-tree VMMs for you.
> 
> That doesn't seem too insane.  This is yet *ANOTHER* reason that doing
> dynamic TDX module initialization is a good idea.

I don't have objection to this.

> 
> > > It's not wrong to say that TDX is a
> > > KVM user.  If KVm wants 'kvm_usage_count' to go back to 0, it can shut
> > > down the TDX module.  Then there's no PAMT to worry about.
> > > 
> > > The shutdown would be something like:
> > > 
> > >       1. TDX module shutdown
> > >       2. Deallocate/Convert PAMT
> > >       3. vmxoff
> > > 
> > > Then, no SEAMCALL failure because of vmxoff can cause a PAMT-induced #MC
> > > to be missed.
> > 
> > The limitation is once the TDX module is shutdown, it cannot be initialized
> > again unless it is runtimely updated.
> > 
> > Long-termly, if we go this design then there might be other problems when other
> > kernel components are using TDX.  For example, the VT-d driver will need to be
> > changed to support TDX-IO, and it will need to enable TDX module much earlier
> > than KVM to do some initialization.  It might need to some TDX work (e.g.,
> > cleanup) while KVM is unloaded.  I am not super familiar with TDX-IO but looks
> > we might have some problem here if we go with such design.
> 
> The burden for who does vmxon will simply need to change from KVM itself
> to some common code that KVM depends on.  Probably not dissimilar to
> those nutty (sorry folks, just calling it as I see 'em) multi-KVM module
> patches that are floating around.
> 

Right we will need to move VMX on/off out of KVM for that purpose.  I think the
point is it's better to not assume how these kernel components will use VMX
on/off.  E.g., it may just choose to simply turn on VMX, do SEMACALL, and
then turn off VMX immediately, while the TDX module will be alive all the time.

Or we require they all need to: 1) enable VMX; 2) enable/use TDX; 3) disable TDX
when no need; 4) disable VMX.

But I don't have strong opinion here too.
Sean Christopherson Dec. 5, 2023, 2:04 a.m. UTC | #11
On Mon, Dec 04, 2023, Dave Hansen wrote:
> On 12/4/23 15:24, Huang, Kai wrote:
> > On Mon, 2023-12-04 at 14:04 -0800, Hansen, Dave wrote:
> ...
> > In ancient time KVM used to immediately enable VMX when it is loaded, but later
> > it was changed to only enable VMX when there's active VM because of the above
> > reason.
> > 
> > See commit 10474ae8945ce ("KVM: Activate Virtualization On Demand").

Huh, I always just assumed it was some backwards thinking about enabling VMX/SVM
being "dangerous" or something.

> Fine.  This doesn't need to change ... until you load TDX.  Once you
> initialize the TDX module, no more out-of-tree VMMs for you.

It's not just out-of-tree hypervisors, which IMO should be little more than an
afterthought.  The other more important issue is that being post-VMXON blocks INIT,
i.e. VMX needs to be disabled before reboot, suspend, etc.  Forcing kvm_usage_count
would work, but it would essentially turn "graceful" reboots, i.e. reboots where
the host isn't running VMs and thus VMX is already disabled.  Having VMX be enabled
so long as KVM is loaded would turn all reboots into the "oh crap, the system is
rebooting, quick do VMXOFF!" variety.

> That doesn't seem too insane.  This is yet *ANOTHER* reason that doing
> dynamic TDX module initialization is a good idea.
> 
> >> It's not wrong to say that TDX is a KVM user.  If KVm wants
> >> 'kvm_usage_count' to go back to 0, it can shut down the TDX module.  Then
> >> there's no PAMT to worry about.
> >>
> >> The shutdown would be something like:
> >>
> >>       1. TDX module shutdown
> >>       2. Deallocate/Convert PAMT
> >>       3. vmxoff
> >>
> >> Then, no SEAMCALL failure because of vmxoff can cause a PAMT-induced #MC
> >> to be missed.
> > 
> > The limitation is once the TDX module is shutdown, it cannot be initialized
> > again unless it is runtimely updated.
> > 
> > Long-termly, if we go this design then there might be other problems when other
> > kernel components are using TDX.  For example, the VT-d driver will need to be
> > changed to support TDX-IO, and it will need to enable TDX module much earlier
> > than KVM to do some initialization.  It might need to some TDX work (e.g.,
> > cleanup) while KVM is unloaded.  I am not super familiar with TDX-IO but looks
> > we might have some problem here if we go with such design.
> 
> The burden for who does vmxon will simply need to change from KVM itself
> to some common code that KVM depends on.  Probably not dissimilar to
> those nutty (sorry folks, just calling it as I see 'em) multi-KVM module

You misspelled "amazing" ;-)

> patches that are floating around.

Joking aside, why shove TDX module ownership into KVM?  It honestly sounds like
a terrible fit, even without the whole TDX-IO mess.  KVM state is largely ephemeral,
in the sense that loading and unloading kvm.ko doesn't allocate/free much memory
or do all that much initialization or teardown.

TDX on the other hand is quite different.  IIRC the PAMT is hundreds of MiB, maybe
over a GiB in most expected use cases?  And also IIRC, TDH.SYS.INIT is rather
long running operation, blocks IRQs, NMIs, (SMIs?), etc.

So rather than shove TDX ownership into KVM and force KVM to figure out how to
manage the TDX module, why not do what us nutty people are suggesting and move
hardware enabling and TDX-module management into a dedicated base module (bonus
points if you call it vac.ko ;-) ).

Alternatively, we could have a dedicated kernel module for TDX, e.g. tdx.ko, and
then have tdx.ko and kvm.ko depend on vac.ko.  But I think that ends up being
quite gross and unnecessary, e.g. in such a setup, kvm-intel.ko ideally wouldn't
take a hard dependency on tdx.ko, as auto-loading tdx.ko would defeat some of the
purpose of the split, and KVM shouldn't fail to load just because TDX isn't supported.
But that'd mean conditionally doing request_module("tdx") or whatever and would
create other conundrums.

(Oof, typing that out made me realize that KVM depends on the PSP driver if
CONFIG_KVM_AMD_SEV=y, even if if the platform owner has no intention of ever using
SEV/SEV-ES.  IIUC, it works because sp_mod_init() just registers a driver, i.e.
doesn't fail out of there's no PSP.  That's kinda gross).

Anyways, vac.ko provides an API to grab a reference to the TDX module, e.g. the
"create a VM" API gets extended to say "create a VM of the TDX variety", and then
vac.ko manages its refcounts to VMX and TDX accordingly.  And KVM obviously keeps
its existing behavior of getting and putting references for each VM.

That way userspace gets to decide when to (un)load tdx.ko without needing to add
a KVM module param or whatever to allow forcefully unloading tdx.ko (which would
be bizarre and probably quite difficult to implement correctly), and unloading
kvm-intel.ko wouldn't require unloading the TDX module.

The end behavior might not be all that different in the short term, but it would
give us more options, e.g. for this erratum, it would be quite easy for vac.ko to
let usersepace choose between keeping VMX "on" (while the TDX module is loaded)
and potentially having imperfect #MC messages.

And out-of-tree hypervisors could even use vac.ko's exported APIs to manage hardware
enabling if they so choose.
Borislav Petkov Dec. 5, 2023, 2:25 p.m. UTC | #12
On Fri, Nov 10, 2023 at 12:55:59AM +1300, Kai Huang wrote:
> +static const char *mce_memory_info(struct mce *m)
> +{
> +	if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
> +		return NULL;
> +
> +	/*
> +	 * Certain initial generations of TDX-capable CPUs have an
> +	 * erratum.  A kernel non-temporal partial write to TDX private
> +	 * memory poisons that memory, and a subsequent read of that
> +	 * memory triggers #MC.
> +	 *
> +	 * However such #MC caused by software cannot be distinguished
> +	 * from the real hardware #MC.  Just print additional message
> +	 * to show such #MC may be result of the CPU erratum.
> +	 */
> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> +		return NULL;
> +
> +	return !tdx_is_private_mem(m->addr) ? NULL :
> +		"TDX private memory error. Possible kernel bug.";
> +}
> +
>  static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>  {
>  	struct llist_node *pending;
>  	struct mce_evt_llist *l;
>  	int apei_err = 0;
> +	const char *memmsg;
>  
>  	/*
>  	 * Allow instrumentation around external facilities usage. Not that it
> @@ -283,6 +307,15 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>  	}
>  	if (exp)
>  		pr_emerg(HW_ERR "Machine check: %s\n", exp);
> +	/*
> +	 * Confidential computing platforms such as TDX platforms
> +	 * may occur MCE due to incorrect access to confidential
> +	 * memory.  Print additional information for such error.
> +	 */
> +	memmsg = mce_memory_info(final);
> +	if (memmsg)
> +		pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
> +

No, this is not how this is done. First of all, this function should be
called something like

	mce_dump_aux_info()

or so to state that it is dumping some auxiliary info.

Then, it does:

	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
		return tdx_get_mce_info();

or so and you put that tdx_get_mce_info() function in TDX code and there
you do all your picking apart of things, what needs to be dumped or what
not, checking whether it is a memory error and so on.

Thx.
Dave Hansen Dec. 5, 2023, 4:36 p.m. UTC | #13
On 12/4/23 18:04, Sean Christopherson wrote:
> Joking aside, why shove TDX module ownership into KVM?  It honestly sounds like
> a terrible fit, even without the whole TDX-IO mess.  KVM state is largely ephemeral,
> in the sense that loading and unloading kvm.ko doesn't allocate/free much memory
> or do all that much initialization or teardown.

Yeah, you have a good point there.  We really do need some core code to
manage VMXON/OFF now that there is increased interest outside of
_purely_ running VMs.

For the purposes of _this_ patch, I think I'm happy to leave open the
possibility that SEAMCALL can simply fail due to VMXOFF.  For now, it
means that we can't attribute #MC's to the PAMT unless a VM is running
but that seems like a reasonable compromise for the moment.

Once TDX gains the ability to "pin" VMXON, the added precision here will
be appreciated.
Tony Luck Dec. 5, 2023, 4:36 p.m. UTC | #14
>> Fine.  This doesn't need to change ... until you load TDX.  Once you
>> initialize the TDX module, no more out-of-tree VMMs for you.
>
> It's not just out-of-tree hypervisors, which IMO should be little more than an
> afterthought.  The other more important issue is that being post-VMXON blocks INIT,

Does that make CPU offline a one-way process? Linux uses INIT to bring a CPU back
online again.

-Tony
Sean Christopherson Dec. 5, 2023, 4:53 p.m. UTC | #15
On Tue, Dec 05, 2023, Dave Hansen wrote:
> On 12/4/23 18:04, Sean Christopherson wrote:
> > Joking aside, why shove TDX module ownership into KVM?  It honestly sounds like
> > a terrible fit, even without the whole TDX-IO mess.  KVM state is largely ephemeral,
> > in the sense that loading and unloading kvm.ko doesn't allocate/free much memory
> > or do all that much initialization or teardown.
> 
> Yeah, you have a good point there.  We really do need some core code to
> manage VMXON/OFF now that there is increased interest outside of
> _purely_ running VMs.
> 
> For the purposes of _this_ patch, I think I'm happy to leave open the
> possibility that SEAMCALL can simply fail due to VMXOFF.  For now, it
> means that we can't attribute #MC's to the PAMT unless a VM is running
> but that seems like a reasonable compromise for the moment.

+1

> Once TDX gains the ability to "pin" VMXON, the added precision here will
> be appreciated.
Sean Christopherson Dec. 5, 2023, 4:57 p.m. UTC | #16
On Tue, Dec 05, 2023, Tony Luck wrote:
> >> Fine.  This doesn't need to change ... until you load TDX.  Once you
> >> initialize the TDX module, no more out-of-tree VMMs for you.
> >
> > It's not just out-of-tree hypervisors, which IMO should be little more than an
> > afterthought.  The other more important issue is that being post-VMXON blocks INIT,
> 
> Does that make CPU offline a one-way process? Linux uses INIT to bring a CPU back
> online again.

No, KVM does VMXOFF on the CPU being offlined, and then VMXON if/when the CPU is
onlined again.  This also handles secondary CPUs for suspend/resume (the primary
CPU hooks .suspend() and .resume()).

static int kvm_offline_cpu(unsigned int cpu)
{
	mutex_lock(&kvm_lock);
	if (kvm_usage_count)
		hardware_disable_nolock(NULL);
	mutex_unlock(&kvm_lock);
	return 0;
}


static int kvm_online_cpu(unsigned int cpu)
{
	int ret = 0;

	/*
	 * Abort the CPU online process if hardware virtualization cannot
	 * be enabled. Otherwise running VMs would encounter unrecoverable
	 * errors when scheduled to this CPU.
	 */
	mutex_lock(&kvm_lock);
	if (kvm_usage_count)
		ret = __hardware_enable_nolock();
	mutex_unlock(&kvm_lock);
	return ret;
}
Huang, Kai Dec. 5, 2023, 7:41 p.m. UTC | #17
On Tue, 2023-12-05 at 15:25 +0100, Borislav Petkov wrote:
> On Fri, Nov 10, 2023 at 12:55:59AM +1300, Kai Huang wrote:
> > +static const char *mce_memory_info(struct mce *m)
> > +{
> > +	if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
> > +		return NULL;
> > +
> > +	/*
> > +	 * Certain initial generations of TDX-capable CPUs have an
> > +	 * erratum.  A kernel non-temporal partial write to TDX private
> > +	 * memory poisons that memory, and a subsequent read of that
> > +	 * memory triggers #MC.
> > +	 *
> > +	 * However such #MC caused by software cannot be distinguished
> > +	 * from the real hardware #MC.  Just print additional message
> > +	 * to show such #MC may be result of the CPU erratum.
> > +	 */
> > +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> > +		return NULL;
> > +
> > +	return !tdx_is_private_mem(m->addr) ? NULL :
> > +		"TDX private memory error. Possible kernel bug.";
> > +}
> > +
> >  static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
> >  {
> >  	struct llist_node *pending;
> >  	struct mce_evt_llist *l;
> >  	int apei_err = 0;
> > +	const char *memmsg;
> >  
> >  	/*
> >  	 * Allow instrumentation around external facilities usage. Not that it
> > @@ -283,6 +307,15 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
> >  	}
> >  	if (exp)
> >  		pr_emerg(HW_ERR "Machine check: %s\n", exp);
> > +	/*
> > +	 * Confidential computing platforms such as TDX platforms
> > +	 * may occur MCE due to incorrect access to confidential
> > +	 * memory.  Print additional information for such error.
> > +	 */
> > +	memmsg = mce_memory_info(final);
> > +	if (memmsg)
> > +		pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
> > +
> 
> No, this is not how this is done. First of all, this function should be
> called something like
> 
> 	mce_dump_aux_info()
> 
> or so to state that it is dumping some auxiliary info.
> 
> Then, it does:
> 
> 	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> 		return tdx_get_mce_info();
> 
> or so and you put that tdx_get_mce_info() function in TDX code and there
> you do all your picking apart of things, what needs to be dumped or what
> not, checking whether it is a memory error and so on.
> 
> Thx.
> 

Thanks Boris.  Looks good to me, with one exception that it is actually TDX
host, but not TDX guest.  So that the above X86_FEATURE_TDX_GUEST cpu feature
check needs to be replaced with the host one.

Full incremental diff below.  Could you take a look whether this is what you
want?

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a621721f63dd..0c02b66dcc41 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -115,13 +115,13 @@ bool platform_tdx_enabled(void);
 int tdx_cpu_enable(void);
 int tdx_enable(void);
 void tdx_reset_memory(void);
-bool tdx_is_private_mem(unsigned long phys);
+const char *tdx_get_mce_info(unsigned long phys);
 #else
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
 static inline void tdx_reset_memory(void) { }
-static inline bool tdx_is_private_mem(unsigned long phys) { return false; }
+static inline const char *tdx_get_mce_info(unsigned long phys) { return NULL; }
 #endif /* CONFIG_INTEL_TDX_HOST */

 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index e33537cfc507..b7e650b5f7ef 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -229,26 +229,20 @@ static void wait_for_panic(void)
        panic("Panicing machine check CPU died");
 }

-static const char *mce_memory_info(struct mce *m)
+static const char *mce_dump_aux_info(struct mce *m)
 {
-       if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
-               return NULL;
-
        /*
-        * Certain initial generations of TDX-capable CPUs have an
-        * erratum.  A kernel non-temporal partial write to TDX private
-        * memory poisons that memory, and a subsequent read of that
-        * memory triggers #MC.
-        *
-        * However such #MC caused by software cannot be distinguished
-        * from the real hardware #MC.  Just print additional message
-        * to show such #MC may be result of the CPU erratum.
+        * Confidential computing platforms such as TDX platforms
+        * may occur MCE due to incorrect access to confidential
+        * memory.  Print additional information for such error.
         */
-       if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+       if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
                return NULL;

-       return !tdx_is_private_mem(m->addr) ? NULL :
-               "TDX private memory error. Possible kernel bug.";
+       if (platform_tdx_enabled())
+               return tdx_get_mce_info(m->addr);
+
+       return NULL;
 }

 static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
@@ -256,7 +250,7 @@ static noinstr void mce_panic(const char *msg, struct mce
*final, char *exp)
        struct llist_node *pending;
        struct mce_evt_llist *l;
        int apei_err = 0;
-       const char *memmsg;
+       const char *auxinfo;

        /*
         * Allow instrumentation around external facilities usage. Not that it
@@ -307,14 +301,10 @@ static noinstr void mce_panic(const char *msg, struct mce
*final, char *exp)
        }
        if (exp)
                pr_emerg(HW_ERR "Machine check: %s\n", exp);
-       /*
-        * Confidential computing platforms such as TDX platforms
-        * may occur MCE due to incorrect access to confidential
-        * memory.  Print additional information for such error.
-        */
-       memmsg = mce_memory_info(final);
-       if (memmsg)
-               pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
+
+       auxinfo = mce_dump_aux_info(final);
+       if (auxinfo)
+               pr_emerg(HW_ERR "Machine check: %s\n", auxinfo);

        if (!fake_panic) {
                if (panic_timeout == 0)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1b84dcdf63cb..cfbaec0f43b2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1329,7 +1329,7 @@ static bool is_pamt_page(unsigned long phys)
  * because it cannot distinguish #MC between software bug and real
  * hardware error anyway.
  */
-bool tdx_is_private_mem(unsigned long phys)
+static bool tdx_is_private_mem(unsigned long phys)
 {
        struct tdx_module_args args = {
                .rcx = phys & PAGE_MASK,
@@ -1391,6 +1391,25 @@ bool tdx_is_private_mem(unsigned long phys)
        }
 }

+const char *tdx_get_mce_info(unsigned long phys)
+{
+       /*
+        * Certain initial generations of TDX-capable CPUs have an
+        * erratum.  A kernel non-temporal partial write to TDX private
+        * memory poisons that memory, and a subsequent read of that
+        * memory triggers #MC.
+        *
+        * However such #MC caused by software cannot be distinguished
+        * from the real hardware #MC.  Just print additional message
+        * to show such #MC may be result of the CPU erratum.
+        */
+       if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+               return NULL;
+
+       return !tdx_is_private_mem(phys) ? NULL :
+               "TDX private memory error. Possible kernel bug.";
+}
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
                                            u32 *nr_tdx_keyids)
 {
Borislav Petkov Dec. 5, 2023, 7:56 p.m. UTC | #18
On Tue, Dec 05, 2023 at 07:41:41PM +0000, Huang, Kai wrote:
> -static const char *mce_memory_info(struct mce *m)
> +static const char *mce_dump_aux_info(struct mce *m)
>  {
> -       if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
> -               return NULL;
> -
>         /*
> -        * Certain initial generations of TDX-capable CPUs have an
> -        * erratum.  A kernel non-temporal partial write to TDX private
> -        * memory poisons that memory, and a subsequent read of that
> -        * memory triggers #MC.
> -        *
> -        * However such #MC caused by software cannot be distinguished
> -        * from the real hardware #MC.  Just print additional message
> -        * to show such #MC may be result of the CPU erratum.
> +        * Confidential computing platforms such as TDX platforms
> +        * may occur MCE due to incorrect access to confidential
> +        * memory.  Print additional information for such error.
>          */
> -       if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> +       if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
>                 return NULL;

What's the point of doing this on !TDX? None.

> -       return !tdx_is_private_mem(m->addr) ? NULL :
> -               "TDX private memory error. Possible kernel bug.";
> +       if (platform_tdx_enabled())

So is this the "host is TDX" check?

Not a X86_FEATURE flag but something homegrown. And Kirill is trying to
switch the CC_ATTRs to X86_FEATURE_ flags for SEV but here you guys are
using something homegrown.

why not a X86_FEATURE_ flag?

The CC_ATTR things are for guests, I guess, but the host feature checks
should be X86_FEATURE_ flags things.

Hmmm.
Huang, Kai Dec. 5, 2023, 8:08 p.m. UTC | #19
On Tue, 2023-12-05 at 20:56 +0100, Borislav Petkov wrote:
> On Tue, Dec 05, 2023 at 07:41:41PM +0000, Huang, Kai wrote:
> > -static const char *mce_memory_info(struct mce *m)
> > +static const char *mce_dump_aux_info(struct mce *m)
> >  {
> > -       if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
> > -               return NULL;
> > -
> >         /*
> > -        * Certain initial generations of TDX-capable CPUs have an
> > -        * erratum.  A kernel non-temporal partial write to TDX private
> > -        * memory poisons that memory, and a subsequent read of that
> > -        * memory triggers #MC.
> > -        *
> > -        * However such #MC caused by software cannot be distinguished
> > -        * from the real hardware #MC.  Just print additional message
> > -        * to show such #MC may be result of the CPU erratum.
> > +        * Confidential computing platforms such as TDX platforms
> > +        * may occur MCE due to incorrect access to confidential
> > +        * memory.  Print additional information for such error.
> >          */
> > -       if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> > +       if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
> >                 return NULL;
> 
> What's the point of doing this on !TDX? None.

OK. I'll move this inside tdx_get_mce_info(). 

> 
> > -       return !tdx_is_private_mem(m->addr) ? NULL :
> > -               "TDX private memory error. Possible kernel bug.";
> > +       if (platform_tdx_enabled())
> 
> So is this the "host is TDX" check?
> 
> Not a X86_FEATURE flag but something homegrown. And Kirill is trying to
> switch the CC_ATTRs to X86_FEATURE_ flags for SEV but here you guys are
> using something homegrown.
> 
> why not a X86_FEATURE_ flag?
> 

The difference is for TDX host the kernel needs to initialize the TDX module
first before TDX can be used.  The module initialization is done at runtime, and
the platform_tdx_enabled() here only returns whether the BIOS has enabled TDX.

IIUC the X86_FEATURE_ flag doesn't suit this purpose because based on my
understanding the flag being present means the kernel has done some enabling
work and the feature is ready to use.
Borislav Petkov Dec. 5, 2023, 8:29 p.m. UTC | #20
On Tue, Dec 05, 2023 at 08:08:34PM +0000, Huang, Kai wrote:
> The difference is for TDX host the kernel needs to initialize the TDX module
> first before TDX can be used.  The module initialization is done at runtime, and
> the platform_tdx_enabled() here only returns whether the BIOS has enabled TDX.
> 
> IIUC the X86_FEATURE_ flag doesn't suit this purpose because based on my
> understanding the flag being present means the kernel has done some enabling
> work and the feature is ready to use.

Which flag do you mean? X86_FEATURE_TDX_GUEST?

I mean, you would set a separate X86_FEATURE_TDX or so flag to denote
that the BIOS has enabled it, at the end of that tdx_init() in the first
patch.
Huang, Kai Dec. 5, 2023, 8:33 p.m. UTC | #21
On Tue, 2023-12-05 at 21:29 +0100, Borislav Petkov wrote:
> On Tue, Dec 05, 2023 at 08:08:34PM +0000, Huang, Kai wrote:
> > The difference is for TDX host the kernel needs to initialize the TDX module
> > first before TDX can be used.  The module initialization is done at runtime, and
> > the platform_tdx_enabled() here only returns whether the BIOS has enabled TDX.
> > 
> > IIUC the X86_FEATURE_ flag doesn't suit this purpose because based on my
> > understanding the flag being present means the kernel has done some enabling
> > work and the feature is ready to use.
> 
> Which flag do you mean? X86_FEATURE_TDX_GUEST?
> 
> I mean, you would set a separate X86_FEATURE_TDX or so flag to denote
> that the BIOS has enabled it, at the end of that tdx_init() in the first
> patch.
> 

Yes I understand what you said.  My point is X86_FEATURE_TDX doesn't suit
because when it is set, the kernel actually hasn't done any enabling work yet
thus TDX is not available albeit the X86_FEATURE_TDX is set.
Borislav Petkov Dec. 5, 2023, 8:41 p.m. UTC | #22
On Tue, Dec 05, 2023 at 08:33:14PM +0000, Huang, Kai wrote:
> Yes I understand what you said.  My point is X86_FEATURE_TDX doesn't suit
> because when it is set, the kernel actually hasn't done any enabling work yet
> thus TDX is not available albeit the X86_FEATURE_TDX is set.

You define a X86_FEATURE flag. You set it *when* TDX is available and
enabled. Then you query that flag. This is how synthetic flags work.

In your patchset, when do you know that TDX is enabled? Point me to the
code place pls.
Dave Hansen Dec. 5, 2023, 8:49 p.m. UTC | #23
On 12/5/23 12:41, Borislav Petkov wrote:
> On Tue, Dec 05, 2023 at 08:33:14PM +0000, Huang, Kai wrote:
>> Yes I understand what you said.  My point is X86_FEATURE_TDX doesn't suit
>> because when it is set, the kernel actually hasn't done any enabling work yet
>> thus TDX is not available albeit the X86_FEATURE_TDX is set.
> You define a X86_FEATURE flag. You set it *when* TDX is available and
> enabled. Then you query that flag. This is how synthetic flags work.
> 
> In your patchset, when do you know that TDX is enabled? Point me to the
> code place pls.

TDX can be "ready" in a couple of different ways:

1. The module is there and running SEAMCALLS (tdx_platform_enabled())
2. The module is initialized and ready to run guests.  This happens
   after init_tdmrs() and init_tdx_module() return success.

#1 is known at boot.
#2 doesn't happen until just before KVM runs the first TDX guest.
Here's the patch for #2:

> https://lore.kernel.org/all/566ff8b05090c935d980d5ace3389d31c7cce7df.1699527082.git.kai.huang@intel.com/
Huang, Kai Dec. 5, 2023, 8:58 p.m. UTC | #24
On Tue, 2023-12-05 at 21:41 +0100, Borislav Petkov wrote:
> On Tue, Dec 05, 2023 at 08:33:14PM +0000, Huang, Kai wrote:
> > Yes I understand what you said.  My point is X86_FEATURE_TDX doesn't suit
> > because when it is set, the kernel actually hasn't done any enabling work yet
> > thus TDX is not available albeit the X86_FEATURE_TDX is set.
> 
> You define a X86_FEATURE flag. You set it *when* TDX is available and
> enabled. Then you query that flag. This is how synthetic flags work.
> 
> In your patchset, when do you know that TDX is enabled? Point me to the
> code place pls.
> 

This patchset provides two functions to allow the user of TDX to enable TDX at
runtime when needed: tdx_cpu_enable() and tdx_enable().

Please see patch:

https://lore.kernel.org/lkml/cover.1699527082.git.kai.huang@intel.com/T/#m96cb9aaa4e323d4e29f7ff6c532f7d33a01995a7

So TDX will be available when tdx_enable() is done successfully.

For now KVM is the only user of TDX, and tdx_enable() will be called by KVM on
demand at runtime.

Hope I've made this clear.  Thanks.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index caca139e7022..a621721f63dd 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -115,11 +115,13 @@  bool platform_tdx_enabled(void);
 int tdx_cpu_enable(void);
 int tdx_enable(void);
 void tdx_reset_memory(void);
+bool tdx_is_private_mem(unsigned long phys);
 #else
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
 static inline void tdx_reset_memory(void) { }
+static inline bool tdx_is_private_mem(unsigned long phys) { return false; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7b397370b4d6..e33537cfc507 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -52,6 +52,7 @@ 
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
+#include <asm/tdx.h>
 
 #include "internal.h"
 
@@ -228,11 +229,34 @@  static void wait_for_panic(void)
 	panic("Panicing machine check CPU died");
 }
 
+static const char *mce_memory_info(struct mce *m)
+{
+	if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
+		return NULL;
+
+	/*
+	 * Certain initial generations of TDX-capable CPUs have an
+	 * erratum.  A kernel non-temporal partial write to TDX private
+	 * memory poisons that memory, and a subsequent read of that
+	 * memory triggers #MC.
+	 *
+	 * However such #MC caused by software cannot be distinguished
+	 * from the real hardware #MC.  Just print additional message
+	 * to show such #MC may be result of the CPU erratum.
+	 */
+	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+		return NULL;
+
+	return !tdx_is_private_mem(m->addr) ? NULL :
+		"TDX private memory error. Possible kernel bug.";
+}
+
 static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 {
 	struct llist_node *pending;
 	struct mce_evt_llist *l;
 	int apei_err = 0;
+	const char *memmsg;
 
 	/*
 	 * Allow instrumentation around external facilities usage. Not that it
@@ -283,6 +307,15 @@  static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 	}
 	if (exp)
 		pr_emerg(HW_ERR "Machine check: %s\n", exp);
+	/*
+	 * Confidential computing platforms such as TDX platforms
+	 * may occur MCE due to incorrect access to confidential
+	 * memory.  Print additional information for such error.
+	 */
+	memmsg = mce_memory_info(final);
+	if (memmsg)
+		pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
+
 	if (!fake_panic) {
 		if (panic_timeout == 0)
 			panic_timeout = mca_cfg.panic_timeout;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index cc21a0f25bee..1b84dcdf63cb 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1288,6 +1288,109 @@  void tdx_reset_memory(void)
 	tdmrs_reset_pamt_all(&tdx_tdmr_list);
 }
 
+static bool is_pamt_page(unsigned long phys)
+{
+	struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
+	int i;
+
+	/*
+	 * This function is called from #MC handler, and theoretically
+	 * it could run in parallel with the TDX module initialization
+	 * on other logical cpus.  But it's not OK to hold mutex here
+	 * so just blindly check module status to make sure PAMTs/TDMRs
+	 * are stable to access.
+	 *
+	 * This may return inaccurate result in rare cases, e.g., when
+	 * #MC happens on a PAMT page during module initialization, but
+	 * this is fine as #MC handler doesn't need a 100% accurate
+	 * result.
+	 */
+	if (tdx_module_status != TDX_MODULE_INITIALIZED)
+		return false;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		unsigned long base, size;
+
+		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
+
+		if (phys >= base && phys < (base + size))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Return whether the memory page at the given physical address is TDX
+ * private memory or not.  Called from #MC handler do_machine_check().
+ *
+ * Note this function may not return an accurate result in rare cases.
+ * This is fine as the #MC handler doesn't need a 100% accurate result,
+ * because it cannot distinguish #MC between software bug and real
+ * hardware error anyway.
+ */
+bool tdx_is_private_mem(unsigned long phys)
+{
+	struct tdx_module_args args = {
+		.rcx = phys & PAGE_MASK,
+	};
+	u64 sret;
+
+	if (!platform_tdx_enabled())
+		return false;
+
+	/* Get page type from the TDX module */
+	sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
+	/*
+	 * Handle the case that CPU isn't in VMX operation.
+	 *
+	 * KVM guarantees no VM is running (thus no TDX guest)
+	 * when there's any online CPU isn't in VMX operation.
+	 * This means there will be no TDX guest private memory
+	 * and Secure-EPT pages.  However the TDX module may have
+	 * been initialized and the memory page could be PAMT.
+	 */
+	if (sret == TDX_SEAMCALL_UD)
+		return is_pamt_page(phys);
+
+	/*
+	 * Any other failure means:
+	 *
+	 * 1) TDX module not loaded; or
+	 * 2) Memory page isn't managed by the TDX module.
+	 *
+	 * In either case, the memory page cannot be a TDX
+	 * private page.
+	 */
+	if (sret)
+		return false;
+
+	/*
+	 * SEAMCALL was successful -- read page type (via RCX):
+	 *
+	 *  - PT_NDA:	Page is not used by the TDX module
+	 *  - PT_RSVD:	Reserved for Non-TDX use
+	 *  - Others:	Page is used by the TDX module
+	 *
+	 * Note PAMT pages are marked as PT_RSVD but they are also TDX
+	 * private memory.
+	 *
+	 * Note: Even page type is PT_NDA, the memory page could still
+	 * be associated with TDX private KeyID if the kernel hasn't
+	 * explicitly used MOVDIR64B to clear the page.  Assume KVM
+	 * always does that after reclaiming any private page from TDX
+	 * gusets.
+	 */
+	switch (args.rcx) {
+	case PT_NDA:
+		return false;
+	case PT_RSVD:
+		return is_pamt_page(phys);
+	default:
+		return true;
+	}
+}
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 					    u32 *nr_tdx_keyids)
 {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index c0610f0bb88c..b701f69485d3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -14,6 +14,7 @@ 
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_PHYMEM_PAGE_RDMD	24
 #define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INIT		33
 #define TDH_SYS_RD		34
@@ -21,6 +22,10 @@ 
 #define TDH_SYS_TDMR_INIT	36
 #define TDH_SYS_CONFIG		45
 
+/* TDX page types */
+#define	PT_NDA		0x0
+#define	PT_RSVD		0x1
+
 /*
  * Global scope metadata field ID.
  *