diff mbox series

[v8,07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

Message ID 8aab33a7db7a408beb403950e21f693b0b0f1f2b.1670566861.git.kai.huang@intel.com (mailing list archive)
State New
Headers show
Series TDX host kernel support | expand

Commit Message

Kai Huang Dec. 9, 2022, 6:52 a.m. UTC
As a step of initializing the TDX module, the kernel needs to tell the
TDX module which memory regions can be used by the TDX module as TDX
guest memory.

TDX reports a list of "Convertible Memory Region" (CMR) to tell the
kernel which memory is TDX compatible.  The kernel needs to build a list
of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
the TDX module.  Once this is done, those "TDX-usable" memory regions
are fixed during module's lifetime.

The initial support of TDX guests will only allocate TDX guest memory
from the global page allocator.  To keep things simple, just make sure
all pages in the page allocator are TDX memory.

To guarantee that, stash off the memblock memory regions at the time of
initializing the TDX module as TDX's own usable memory regions, and in
the meantime, register a TDX memory notifier to reject to online any new
memory in memory hotplug.

This approach works as in practice all boot-time present DIMMs are TDX
convertible memory.  However, if any non-TDX-convertible memory has been
hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
module, the module initialization will fail.

This can also be enhanced in the future, i.e. by allowing adding non-TDX
memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
needs to guarantee memory pages for TDX guests are always allocated from
the "TDX-capable" nodes.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8:
 - Trimed down changelog (Dave).
 - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
   (Ying).
 - Moved memory hotplug handling from add_arch_memory() to
   memory_notifier (Dan/David).
 - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
 - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
 - Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
 - Improve the comment around first 1MB (Dave).
 - Added a comment around reserve_real_mode() to point out TDX code
   relies on first 1MB being reserved (Ying).
 - Added comment to explain why the new online memory range cannot
   cross multiple TDX memory blocks (Dave).
 - Improved other comments (Dave).

---
 arch/x86/Kconfig            |   1 +
 arch/x86/kernel/setup.c     |   2 +
 arch/x86/virt/vmx/tdx/tdx.c | 160 +++++++++++++++++++++++++++++++++++-
 3 files changed, 162 insertions(+), 1 deletion(-)

Comments

Dave Hansen Jan. 6, 2023, 6:18 p.m. UTC | #1
On 12/8/22 22:52, Kai Huang wrote:
> As a step of initializing the TDX module, the kernel needs to tell the
> TDX module which memory regions can be used by the TDX module as TDX
> guest memory.
> 
> TDX reports a list of "Convertible Memory Region" (CMR) to tell the
> kernel which memory is TDX compatible.  The kernel needs to build a list
> of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
> the TDX module.  Once this is done, those "TDX-usable" memory regions
> are fixed during module's lifetime.
> 
> The initial support of TDX guests will only allocate TDX guest memory
> from the global page allocator.  To keep things simple, just make sure
> all pages in the page allocator are TDX memory.

It's hard to tell what "The initial support of TDX guests" means.  I
*think* you mean "this series".  But, we try not to say "this blah" too
much, so just say this:

	To keep things simple, assume that all TDX-protected memory will
	comes page allocator.  Make sure all pages in the page allocator
	*are* TDX-usable memory.

> To guarantee that, stash off the memblock memory regions at the time of
> initializing the TDX module as TDX's own usable memory regions, and in
> the meantime, register a TDX memory notifier to reject to online any new
> memory in memory hotplug.

First, this is a run-on sentence.  Second, it isn't really clear what
memblocks have to do with this or why you need to stash them off.
Please explain.

> This approach works as in practice all boot-time present DIMMs are TDX
> convertible memory.  However, if any non-TDX-convertible memory has been
> hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
> module, the module initialization will fail.

I really don't know what this is trying to say.

*How* and *why* does this module initialization failure occur?  How do
you implement it and why is it necessary?

> This can also be enhanced in the future, i.e. by allowing adding non-TDX
> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> needs to guarantee memory pages for TDX guests are always allocated from
> the "TDX-capable" nodes.

Why does it need to be enhanced?  What's the problem?

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index dd333b46fafb..b36129183035 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
>  	depends on X86_64
>  	depends on KVM_INTEL
>  	depends on X86_X2APIC
> +	select ARCH_KEEP_MEMBLOCK
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 216fee7144ee..3a841a77fda4 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1174,6 +1174,8 @@ void __init setup_arch(char **cmdline_p)
>  	 *
>  	 * Moreover, on machines with SandyBridge graphics or in setups that use
>  	 * crashkernel the entire 1M is reserved anyway.
> +	 *
> +	 * Note the host kernel TDX also requires the first 1MB being reserved.
>  	 */
>  	reserve_real_mode();
>  
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 6fe505c32599..f010402f443d 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,13 @@
>  #include <linux/errno.h>
>  #include <linux/printk.h>
>  #include <linux/mutex.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/memblock.h>
> +#include <linux/memory.h>
> +#include <linux/minmax.h>
> +#include <linux/sizes.h>
> +#include <linux/pfn.h>
>  #include <asm/pgtable_types.h>
>  #include <asm/msr.h>
>  #include <asm/tdx.h>
> @@ -25,6 +32,12 @@ enum tdx_module_status_t {
>  	TDX_MODULE_ERROR
>  };
>  
> +struct tdx_memblock {
> +	struct list_head list;
> +	unsigned long start_pfn;
> +	unsigned long end_pfn;
> +};
> +
>  static u32 tdx_keyid_start __ro_after_init;
>  static u32 nr_tdx_keyids __ro_after_init;
>  
> @@ -32,6 +45,9 @@ static enum tdx_module_status_t tdx_module_status;
>  /* Prevent concurrent attempts on TDX detection and initialization */
>  static DEFINE_MUTEX(tdx_module_lock);
>  
> +/* All TDX-usable memory regions */
> +static LIST_HEAD(tdx_memlist);
> +
>  /*
>   * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
>   * This is used in TDX initialization error paths to take it from
> @@ -69,6 +85,50 @@ static int __init record_keyid_partitioning(void)
>  	return 0;
>  }
>  
> +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	/* Empty list means TDX isn't enabled. */
> +	if (list_empty(&tdx_memlist))
> +		return true;
> +
> +	list_for_each_entry(tmb, &tdx_memlist, list) {
> +		/*
> +		 * The new range is TDX memory if it is fully covered by
> +		 * any TDX memory block.
> +		 *
> +		 * Note TDX memory blocks are originated from memblock
> +		 * memory regions, which can only be contiguous when two
> +		 * regions have different NUMA nodes or flags.  Therefore
> +		 * the new range cannot cross multiple TDX memory blocks.
> +		 */
> +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> +			return true;
> +	}
> +	return false;
> +}

I don't really like that comment.  It should first state its behavior
and assumptions, like:

	This check assumes that the start_pfn<->end_pfn range does not
	cross multiple tdx_memlist entries.

Only then should it describe why that is OK:

	A single memory hotplug even across mutliple memblocks (from
	which tdx_memlist entries are derived) is impossible.  ... then
	actually explain



> +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
> +			       void *v)
> +{
> +	struct memory_notify *mn = v;
> +
> +	if (action != MEM_GOING_ONLINE)
> +		return NOTIFY_OK;
> +
> +	/*
> +	 * Not all memory is compatible with TDX.  Reject
> +	 * to online any incompatible memory.
> +	 */

This comment isn't quite right either.  There might actually be totally
TDX *compatible* memory here.  It just wasn't configured for use with TDX.

Shouldn't this be something more like:

	/*
	 * The TDX memory configuration is static and can not be
	 * changed.  Reject onlining any memory which is outside
	 * of the static configuration whether it supports TDX or not.
	 */

> +	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
> +		NOTIFY_OK : NOTIFY_BAD;
> +}
> +
> +static struct notifier_block tdx_memory_nb = {
> +	.notifier_call = tdx_memory_notifier,
> +};
> +
>  static int __init tdx_init(void)
>  {
>  	int err;
> @@ -89,6 +149,13 @@ static int __init tdx_init(void)
>  		goto no_tdx;
>  	}
>  
> +	err = register_memory_notifier(&tdx_memory_nb);
> +	if (err) {
> +		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
> +				err);
> +		goto no_tdx;
> +	}
> +
>  	return 0;
>  no_tdx:
>  	clear_tdx();
> @@ -209,6 +276,77 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
>  	return 0;
>  }
>  
> +/*
> + * Add a memory region as a TDX memory block.  The caller must make sure
> + * all memory regions are added in address ascending order and don't
> + * overlap.
> + */
> +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> +			    unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> +	if (!tmb)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tmb->list);
> +	tmb->start_pfn = start_pfn;
> +	tmb->end_pfn = end_pfn;
> +
> +	list_add_tail(&tmb->list, tmb_list);
> +	return 0;
> +}
> +
> +static void free_tdx_memlist(struct list_head *tmb_list)
> +{
> +	while (!list_empty(tmb_list)) {
> +		struct tdx_memblock *tmb = list_first_entry(tmb_list,
> +				struct tdx_memblock, list);
> +
> +		list_del(&tmb->list);
> +		kfree(tmb);
> +	}
> +}

'tdx_memlist' is written only once at boot and then is read-only, right?

It might be nice to mention that so that the lack of locking doesn't
look problematic.

> +/*
> + * Ensure that all memblock memory regions are convertible to TDX
> + * memory.  Once this has been established, stash the memblock
> + * ranges off in a secondary structure because memblock is modified
> + * in memory hotplug while TDX memory regions are fixed.
> + */

Ahh, that's why we need to "shadow" the memblocks.  Can you add a
sentence on this to the changelog, please?

> +static int build_tdx_memlist(struct list_head *tmb_list)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i, ret;
> +
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> +		/*
> +		 * The first 1MB is not reported as TDX convertible memory.
> +		 * Although the first 1MB is always reserved and won't end up
> +		 * to the page allocator, it is still in memblock's memory
> +		 * regions.  Skip them manually to exclude them as TDX memory.
> +		 */
> +		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
> +		if (start_pfn >= end_pfn)
> +			continue;
> +
> +		/*
> +		 * Add the memory regions as TDX memory.  The regions in
> +		 * memblock has already guaranteed they are in address
> +		 * ascending order and don't overlap.
> +		 */
> +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	free_tdx_memlist(tmb_list);
> +	return ret;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	/*
> @@ -226,10 +364,25 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/*
> +	 * The initial support of TDX guests only allocates memory from
> +	 * the global page allocator.  To keep things simple, just make
> +	 * sure all pages in the page allocator are TDX memory.

I didn't like this in the changelog either.  Try to make this "timeless"
rather than refer to what the support is today.  I gave you example text
above.

> +	 * Build the list of "TDX-usable" memory regions which cover all
> +	 * pages in the page allocator to guarantee that.  Do it while
> +	 * holding mem_hotplug_lock read-lock as the memory hotplug code
> +	 * path reads the @tdx_memlist to reject any new memory.
> +	 */
> +	get_online_mems();

Oh, it actually uses the memory hotplug locking for list protection.
That's at least a bit subtle.  Please document that somewhere in the
functions that actually manipulate the list.

I think it's also worth saying something here about the high-level
effects of what's going on:

	Take a snapshot of the memory configuration (memblocks).  This
	snapshot will be used to enable TDX support for *this* memory
	configuration only.  Use a memory hotplug notifier to ensure
	that no other RAM can be added outside of this configuration.

That's it, right?

> +	ret = build_tdx_memlist(&tdx_memlist);
> +	if (ret)
> +		goto out;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Build the list of TDX-usable memory regions.
>  	 *  - Construct a list of TDMRs to cover all TDX-usable memory
>  	 *    regions.
>  	 *  - Pick up one TDX private KeyID as the global KeyID.
> @@ -241,6 +394,11 @@ static int init_tdx_module(void)
>  	 */
>  	ret = -EINVAL;
>  out:
> +	/*
> +	 * @tdx_memlist is written here and read at memory hotplug time.
> +	 * Lock out memory hotplug code while building it.
> +	 */
> +	put_online_mems();
>  	return ret;
>  }

You would also be wise to have the folks who do a lot of memory hotplug
work today look at this sooner rather than later.  I _think_ what you
have here is OK, but I'm really rusty on the code itself.
Kai Huang Jan. 9, 2023, 11:48 a.m. UTC | #2
On Fri, 2023-01-06 at 10:18 -0800, Dave Hansen wrote:
> > On 12/8/22 22:52, Kai Huang wrote:
> > > > As a step of initializing the TDX module, the kernel needs to tell the
> > > > TDX module which memory regions can be used by the TDX module as TDX
> > > > guest memory.
> > > > 
> > > > TDX reports a list of "Convertible Memory Region" (CMR) to tell the
> > > > kernel which memory is TDX compatible.  The kernel needs to build a list
> > > > of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
> > > > the TDX module.  Once this is done, those "TDX-usable" memory regions
> > > > are fixed during module's lifetime.
> > > > 
> > > > The initial support of TDX guests will only allocate TDX guest memory
> > > > from the global page allocator.  To keep things simple, just make sure
> > > > all pages in the page allocator are TDX memory.
> > 
> > It's hard to tell what "The initial support of TDX guests" means.  I
> > *think* you mean "this series".  But, we try not to say "this blah" too
> > much, so just say this:
> > 
> > 	To keep things simple, assume that all TDX-protected memory will
> > 	comes page allocator.  Make sure all pages in the page allocator
> > 	*are* TDX-usable memory.
> > 

Yes will do. Thanks.

> > > > To guarantee that, stash off the memblock memory regions at the time of
> > > > initializing the TDX module as TDX's own usable memory regions, and in
> > > > the meantime, register a TDX memory notifier to reject to online any new
> > > > memory in memory hotplug.
> > 
> > First, this is a run-on sentence.  Second, it isn't really clear what
> > memblocks have to do with this or why you need to stash them off.
> > Please explain.

Will break up the run-on sentence. And as you mentioned below, will add a
sentence to explain why need to stash off memblocks.

> > 
> > > > This approach works as in practice all boot-time present DIMMs are TDX
> > > > convertible memory.  However, if any non-TDX-convertible memory has been
> > > > hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
> > > > module, the module initialization will fail.
> > 
> > I really don't know what this is trying to say.

My intention is to explain and call out that such design (use all memory regions
in memblock at the time of module initialization) works in practice, as long as
non-CMR memory hasn't been added via memory hotplug.

Not sure if it is necessary, but I was thinking it may help reviewer to judge
whether such design is acceptable.

> > 
> > *How* and *why* does this module initialization failure occur?  
> > 

If we pass any non-CMR memory to the TDX module, the SEAMCALL (TDH.SYS.CONFIG)
will fail.

> > How do
> > you implement it and why is it necessary?

As mentioned above, we depend on SEAMCALL to fail.

I am not sure whether it is necessary, but I thought it could help people to
review.

> > 
> > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > the "TDX-capable" nodes.
> > 
> > Why does it need to be enhanced?  What's the problem?

The problem is after TDX module initialization, no more memory can be hot-added
to the page allocator.

Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
utilize all memory.

But probably it is not necessarily to call out in the changelog?

> > 
> > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > index dd333b46fafb..b36129183035 100644
> > > > --- a/arch/x86/Kconfig
> > > > +++ b/arch/x86/Kconfig
> > > > @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
> > > >  	depends on X86_64
> > > >  	depends on KVM_INTEL
> > > >  	depends on X86_X2APIC
> > > > +	select ARCH_KEEP_MEMBLOCK
> > > >  	help
> > > >  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > > >  	  host and certain physical attacks.  This option enables necessary TDX
> > > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > > > index 216fee7144ee..3a841a77fda4 100644
> > > > --- a/arch/x86/kernel/setup.c
> > > > +++ b/arch/x86/kernel/setup.c
> > > > @@ -1174,6 +1174,8 @@ void __init setup_arch(char **cmdline_p)
> > > >  	 *
> > > >  	 * Moreover, on machines with SandyBridge graphics or in setups that use
> > > >  	 * crashkernel the entire 1M is reserved anyway.
> > > > +	 *
> > > > +	 * Note the host kernel TDX also requires the first 1MB being reserved.
> > > >  	 */
> > > >  	reserve_real_mode();
> > > >  
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > > index 6fe505c32599..f010402f443d 100644
> > > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > > @@ -13,6 +13,13 @@
> > > >  #include <linux/errno.h>
> > > >  #include <linux/printk.h>
> > > >  #include <linux/mutex.h>
> > > > +#include <linux/list.h>
> > > > +#include <linux/slab.h>
> > > > +#include <linux/memblock.h>
> > > > +#include <linux/memory.h>
> > > > +#include <linux/minmax.h>
> > > > +#include <linux/sizes.h>
> > > > +#include <linux/pfn.h>
> > > >  #include <asm/pgtable_types.h>
> > > >  #include <asm/msr.h>
> > > >  #include <asm/tdx.h>
> > > > @@ -25,6 +32,12 @@ enum tdx_module_status_t {
> > > >  	TDX_MODULE_ERROR
> > > >  };
> > > >  
> > > > +struct tdx_memblock {
> > > > +	struct list_head list;
> > > > +	unsigned long start_pfn;
> > > > +	unsigned long end_pfn;
> > > > +};
> > > > +
> > > >  static u32 tdx_keyid_start __ro_after_init;
> > > >  static u32 nr_tdx_keyids __ro_after_init;
> > > >  
> > > > @@ -32,6 +45,9 @@ static enum tdx_module_status_t tdx_module_status;
> > > >  /* Prevent concurrent attempts on TDX detection and initialization */
> > > >  static DEFINE_MUTEX(tdx_module_lock);
> > > >  
> > > > +/* All TDX-usable memory regions */
> > > > +static LIST_HEAD(tdx_memlist);
> > > > +
> > > >  /*
> > > >   * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
> > > >   * This is used in TDX initialization error paths to take it from
> > > > @@ -69,6 +85,50 @@ static int __init record_keyid_partitioning(void)
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
> > > > +{
> > > > +	struct tdx_memblock *tmb;
> > > > +
> > > > +	/* Empty list means TDX isn't enabled. */
> > > > +	if (list_empty(&tdx_memlist))
> > > > +		return true;
> > > > +
> > > > +	list_for_each_entry(tmb, &tdx_memlist, list) {
> > > > +		/*
> > > > +		 * The new range is TDX memory if it is fully covered by
> > > > +		 * any TDX memory block.
> > > > +		 *
> > > > +		 * Note TDX memory blocks are originated from memblock
> > > > +		 * memory regions, which can only be contiguous when two
> > > > +		 * regions have different NUMA nodes or flags.  Therefore
> > > > +		 * the new range cannot cross multiple TDX memory blocks.
> > > > +		 */
> > > > +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > > > +			return true;
> > > > +	}
> > > > +	return false;
> > > > +}
> > 
> > I don't really like that comment.  It should first state its behavior
> > and assumptions, like:
> > 
> > 	This check assumes that the start_pfn<->end_pfn range does not
> > 	cross multiple tdx_memlist entries.
> > 
> > Only then should it describe why that is OK:
> > 
> > 	A single memory hotplug even across mutliple memblocks (from
> > 	which tdx_memlist entries are derived) is impossible.  ... then
> > 	actually explain
> > 

How about below?

	/*
	 * This check assumes that the start_pfn<->end_pfn range does not cross
	 * multiple tdx_memlist entries. A single memory hotplug event across
	 * multiple memblocks (from which tdx_memlist entries are derived) is
	 * impossible. That means start_pfn<->end_pfn range cannot exceed a
	 * tdx_memlist entry, and the new range is TDX memory if it is fully
	 * covered by any tdx_memlist entry.
	 */

> > 
> > 
> > > > +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
> > > > +			       void *v)
> > > > +{
> > > > +	struct memory_notify *mn = v;
> > > > +
> > > > +	if (action != MEM_GOING_ONLINE)
> > > > +		return NOTIFY_OK;
> > > > +
> > > > +	/*
> > > > +	 * Not all memory is compatible with TDX.  Reject
> > > > +	 * to online any incompatible memory.
> > > > +	 */
> > 
> > This comment isn't quite right either.  There might actually be totally
> > TDX *compatible* memory here.  It just wasn't configured for use with TDX.
> > 
> > Shouldn't this be something more like:
> > 
> > 	/*
> > 	 * The TDX memory configuration is static and can not be
> > 	 * changed.  Reject onlining any memory which is outside
> > 	 * of the static configuration whether it supports TDX or not.
> > 	 */

Yes it's better. Thanks.

My intention was the "incompatible memory" in the original comment actually
means "out-of-static-configuration" memory, but indeed it can be easily confused
with non-CMR memory.

> > 
> > > > +	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
> > > > +		NOTIFY_OK : NOTIFY_BAD;
> > > > +}
> > > > +
> > > > +static struct notifier_block tdx_memory_nb = {
> > > > +	.notifier_call = tdx_memory_notifier,
> > > > +};
> > > > +
> > > >  static int __init tdx_init(void)
> > > >  {
> > > >  	int err;
> > > > @@ -89,6 +149,13 @@ static int __init tdx_init(void)
> > > >  		goto no_tdx;
> > > >  	}
> > > >  
> > > > +	err = register_memory_notifier(&tdx_memory_nb);
> > > > +	if (err) {
> > > > +		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
> > > > +				err);
> > > > +		goto no_tdx;
> > > > +	}
> > > > +
> > > >  	return 0;
> > > >  no_tdx:
> > > >  	clear_tdx();
> > > > @@ -209,6 +276,77 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +/*
> > > > + * Add a memory region as a TDX memory block.  The caller must make sure
> > > > + * all memory regions are added in address ascending order and don't
> > > > + * overlap.
> > > > + */
> > > > +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> > > > +			    unsigned long end_pfn)
> > > > +{
> > > > +	struct tdx_memblock *tmb;
> > > > +
> > > > +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> > > > +	if (!tmb)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	INIT_LIST_HEAD(&tmb->list);
> > > > +	tmb->start_pfn = start_pfn;
> > > > +	tmb->end_pfn = end_pfn;
> > > > +
> > > > +	list_add_tail(&tmb->list, tmb_list);
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static void free_tdx_memlist(struct list_head *tmb_list)
> > > > +{
> > > > +	while (!list_empty(tmb_list)) {
> > > > +		struct tdx_memblock *tmb = list_first_entry(tmb_list,
> > > > +				struct tdx_memblock, list);
> > > > +
> > > > +		list_del(&tmb->list);
> > > > +		kfree(tmb);
> > > > +	}
> > > > +}
> > 
> > 'tdx_memlist' is written only once at boot and then is read-only, right?
> > 
> > It might be nice to mention that so that the lack of locking doesn't
> > look problematic.

No. 'tdx_memlist' is only written in init_tdx_module(), and it is only read in
memory hotplug. The read/write of 'tdx_memlist' is protected by memory hotplug
locking as you also mentioned below.

> > 
> > > > +/*
> > > > + * Ensure that all memblock memory regions are convertible to TDX
> > > > + * memory.  Once this has been established, stash the memblock
> > > > + * ranges off in a secondary structure because memblock is modified
> > > > + * in memory hotplug while TDX memory regions are fixed.
> > > > + */
> > 
> > Ahh, that's why we need to "shadow" the memblocks.  Can you add a
> > sentence on this to the changelog, please?

Yes will do.

> > 
> > > > +static int build_tdx_memlist(struct list_head *tmb_list)
> > > > +{
> > > > +	unsigned long start_pfn, end_pfn;
> > > > +	int i, ret;
> > > > +
> > > > +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> > > > +		/*
> > > > +		 * The first 1MB is not reported as TDX convertible memory.
> > > > +		 * Although the first 1MB is always reserved and won't end up
> > > > +		 * to the page allocator, it is still in memblock's memory
> > > > +		 * regions.  Skip them manually to exclude them as TDX memory.
> > > > +		 */
> > > > +		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
> > > > +		if (start_pfn >= end_pfn)
> > > > +			continue;
> > > > +
> > > > +		/*
> > > > +		 * Add the memory regions as TDX memory.  The regions in
> > > > +		 * memblock has already guaranteed they are in address
> > > > +		 * ascending order and don't overlap.
> > > > +		 */
> > > > +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> > > > +		if (ret)
> > > > +			goto err;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +err:
> > > > +	free_tdx_memlist(tmb_list);
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static int init_tdx_module(void)
> > > >  {
> > > >  	/*
> > > > @@ -226,10 +364,25 @@ static int init_tdx_module(void)
> > > >  	if (ret)
> > > >  		goto out;
> > > >  
> > > > +	/*
> > > > +	 * The initial support of TDX guests only allocates memory from
> > > > +	 * the global page allocator.  To keep things simple, just make
> > > > +	 * sure all pages in the page allocator are TDX memory.
> > 
> > I didn't like this in the changelog either.  Try to make this "timeless"
> > rather than refer to what the support is today.  I gave you example text
> > above.

Yes will improve.

> > 
> > > > +	 * Build the list of "TDX-usable" memory regions which cover all
> > > > +	 * pages in the page allocator to guarantee that.  Do it while
> > > > +	 * holding mem_hotplug_lock read-lock as the memory hotplug code
> > > > +	 * path reads the @tdx_memlist to reject any new memory.
> > > > +	 */
> > > > +	get_online_mems();
> > 
> > Oh, it actually uses the memory hotplug locking for list protection.
> > That's at least a bit subtle.  Please document that somewhere in the
> > functions that actually manipulate the list.

add_tdx_memblock() and free_tdx_memlist() eventually calls list_add_tail() and
list_del() to manipulate the list, but they actually takes 'struct list_head
*tmb_list' as argument. 'tdx_memlist' is passed to build_tdx_memlist() as input.

Do you mean document the locking around the implementation of add_tdx_memblock()
and free_tdx_memlist()?

Or should we just mention it around the 'tdx_memlist' variable?

/* All TDX-usable memory regions. Protected by memory hotplug locking. */
static LIST_HEAD(tdx_memlist);

> > 
> > I think it's also worth saying something here about the high-level
> > effects of what's going on:
> > 
> > 	Take a snapshot of the memory configuration (memblocks).  This
> > 	snapshot will be used to enable TDX support for *this* memory
> > 	configuration only.  Use a memory hotplug notifier to ensure
> > 	that no other RAM can be added outside of this configuration.
> > 
> > That's it, right?

Yes. I'll somehow include above into the comment around get_online_mems().

But should I move "Use a memory hotplug notifier ..." part to:

	err = register_memory_notifier(&tdx_memory_nb);

because this is where we actually use the memory hotplug notifier?

> > 
> > > > +	ret = build_tdx_memlist(&tdx_memlist);
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > >  	/*
> > > >  	 * TODO:
> > > >  	 *
> > > > -	 *  - Build the list of TDX-usable memory regions.
> > > >  	 *  - Construct a list of TDMRs to cover all TDX-usable memory
> > > >  	 *    regions.
> > > >  	 *  - Pick up one TDX private KeyID as the global KeyID.
> > > > @@ -241,6 +394,11 @@ static int init_tdx_module(void)
> > > >  	 */
> > > >  	ret = -EINVAL;
> > > >  out:
> > > > +	/*
> > > > +	 * @tdx_memlist is written here and read at memory hotplug time.
> > > > +	 * Lock out memory hotplug code while building it.
> > > > +	 */
> > > > +	put_online_mems();
> > > >  	return ret;
> > > >  }
> > 
> > You would also be wise to have the folks who do a lot of memory hotplug
> > work today look at this sooner rather than later.  I _think_ what you
> > have here is OK, but I'm really rusty on the code itself.
> > 

Thanks for advice. Will do.
Dave Hansen Jan. 9, 2023, 4:51 p.m. UTC | #3
On 1/9/23 03:48, Huang, Kai wrote:
...
>>>>> This approach works as in practice all boot-time present DIMMs are TDX
>>>>> convertible memory.  However, if any non-TDX-convertible memory has been
>>>>> hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
>>>>> module, the module initialization will fail.
>>>
>>> I really don't know what this is trying to say.
> 
> My intention is to explain and call out that such design (use all memory regions
> in memblock at the time of module initialization) works in practice, as long as
> non-CMR memory hasn't been added via memory hotplug.
> 
> Not sure if it is necessary, but I was thinking it may help reviewer to judge
> whether such design is acceptable.

This is yet another case where you've mechanically described the "what",
but left out the implications or the underlying basis "why".

I'd take a more methodical approach to describe what is going on here.
List the steps that must occur, or at least *one* example of those steps
and how they intereact with the code in this patch.  Then, explain the
fallout.

I also don't think it's quite right to call out "CXL memory via kmem
driver".  If the CXL memory was "System RAM", it should get covered by a
CMR and TDMR.  The kmem driver can still go wild with it.

>>> *How* and *why* does this module initialization failure occur?
> 
> If we pass any non-CMR memory to the TDX module, the SEAMCALL (TDH.SYS.CONFIG)
> will fail.

I'm frankly lost now.  Please go back and try to explain this better.
Let me know if you want to iterate on this faster than resending this
series five more times.  I've got some ideas.

>>>>> This can also be enhanced in the future, i.e. by allowing adding non-TDX
>>>>> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
>>>>> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
>>>>> needs to guarantee memory pages for TDX guests are always allocated from
>>>>> the "TDX-capable" nodes.
>>>
>>> Why does it need to be enhanced?  What's the problem?
> 
> The problem is after TDX module initialization, no more memory can be hot-added
> to the page allocator.
> 
> Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> utilize all memory.
> 
> But probably it is not necessarily to call out in the changelog?

Let's say that we add this TDX-compatible-node ABI in the future.  What
will old code do that doesn't know about this ABI?

...
>>>>> +       list_for_each_entry(tmb, &tdx_memlist, list) {
>>>>> +               /*
>>>>> +                * The new range is TDX memory if it is fully covered by
>>>>> +                * any TDX memory block.
>>>>> +                *
>>>>> +                * Note TDX memory blocks are originated from memblock
>>>>> +                * memory regions, which can only be contiguous when two
>>>>> +                * regions have different NUMA nodes or flags.  Therefore
>>>>> +                * the new range cannot cross multiple TDX memory blocks.
>>>>> +                */
>>>>> +               if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
>>>>> +                       return true;
>>>>> +       }
>>>>> +       return false;
>>>>> +}
>>>
>>> I don't really like that comment.  It should first state its behavior
>>> and assumptions, like:
>>>
>>>     This check assumes that the start_pfn<->end_pfn range does not
>>>     cross multiple tdx_memlist entries.
>>>
>>> Only then should it describe why that is OK:
>>>
>>>     A single memory hotplug even across mutliple memblocks (from
>>>     which tdx_memlist entries are derived) is impossible.  ... then
>>>     actually explain
>>>
> 
> How about below?
> 
>         /*
>          * This check assumes that the start_pfn<->end_pfn range does not cross
>          * multiple tdx_memlist entries. A single memory hotplug event across
>          * multiple memblocks (from which tdx_memlist entries are derived) is
>          * impossible. That means start_pfn<->end_pfn range cannot exceed a
>          * tdx_memlist entry, and the new range is TDX memory if it is fully
>          * covered by any tdx_memlist entry.
>          */

I was hoping you would actually explain why it is impossible.

Is there something fundamental that keeps a memory area that spans two
nodes from being removed and then a new area added that is comprised of
a single node?

Boot time:

	| memblock  |  memblock |
	<--Node=0--> <--Node=1-->

Funky hotplug... nothing to see here, then:

	<--------Node=2-------->

I would believe that there is no current bare-metal TDX system that has
an implementation like this.  But, the comments above speak like it's
fundamentally impossible.  That should be clarified.

In other words, that comment talks about memblock attributes as being
the core underlying reason that that simplified check is OK.  Is that
it, or is it really the reduced hotplug feature set on TDX systems?


...
>>>>> +        * Build the list of "TDX-usable" memory regions which cover all
>>>>> +        * pages in the page allocator to guarantee that.  Do it while
>>>>> +        * holding mem_hotplug_lock read-lock as the memory hotplug code
>>>>> +        * path reads the @tdx_memlist to reject any new memory.
>>>>> +        */
>>>>> +       get_online_mems();
>>>
>>> Oh, it actually uses the memory hotplug locking for list protection.
>>> That's at least a bit subtle.  Please document that somewhere in the
>>> functions that actually manipulate the list.
> 
> add_tdx_memblock() and free_tdx_memlist() eventually calls list_add_tail() and
> list_del() to manipulate the list, but they actually takes 'struct list_head
> *tmb_list' as argument. 'tdx_memlist' is passed to build_tdx_memlist() as input.
> 
> Do you mean document the locking around the implementation of add_tdx_memblock()
> and free_tdx_memlist()?
> 
> Or should we just mention it around the 'tdx_memlist' variable?
> 
> /* All TDX-usable memory regions. Protected by memory hotplug locking. */
> static LIST_HEAD(tdx_memlist);

I don't think I'd hate it being in all three spots.  Also "protected by
memory hotplug locking" is pretty generic.  Please be more specific.

>>> I think it's also worth saying something here about the high-level
>>> effects of what's going on:
>>>
>>>     Take a snapshot of the memory configuration (memblocks).  This
>>>     snapshot will be used to enable TDX support for *this* memory
>>>     configuration only.  Use a memory hotplug notifier to ensure
>>>     that no other RAM can be added outside of this configuration.
>>>
>>> That's it, right?
> 
> Yes. I'll somehow include above into the comment around get_online_mems().
> 
> But should I move "Use a memory hotplug notifier ..." part to:
> 
>         err = register_memory_notifier(&tdx_memory_nb);
> 
> because this is where we actually use the memory hotplug notifier?

I actually want that snippet in the changelog.

>>>>> +       ret = build_tdx_memlist(&tdx_memlist);
>>>>> +       if (ret)
>>>>> +               goto out;
>>>>> +
>>>>>         /*
>>>>>          * TODO:
>>>>>          *
>>>>> -        *  - Build the list of TDX-usable memory regions.
>>>>>          *  - Construct a list of TDMRs to cover all TDX-usable memory
>>>>>          *    regions.
>>>>>          *  - Pick up one TDX private KeyID as the global KeyID.
>>>>> @@ -241,6 +394,11 @@ static int init_tdx_module(void)
>>>>>          */
>>>>>         ret = -EINVAL;
>>>>>  out:
>>>>> +       /*
>>>>> +        * @tdx_memlist is written here and read at memory hotplug time.
>>>>> +        * Lock out memory hotplug code while building it.
>>>>> +        */
>>>>> +       put_online_mems();
>>>>>         return ret;
>>>>>  }
>>>
>>> You would also be wise to have the folks who do a lot of memory hotplug
>>> work today look at this sooner rather than later.  I _think_ what you
>>> have here is OK, but I'm really rusty on the code itself.
>>>
> 
> Thanks for advice. Will do.
>
Kai Huang Jan. 10, 2023, 12:09 p.m. UTC | #4
On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> On 1/9/23 03:48, Huang, Kai wrote:
> ...
> > > > > > This approach works as in practice all boot-time present DIMMs are TDX
> > > > > > convertible memory.  However, if any non-TDX-convertible memory has been
> > > > > > hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
> > > > > > module, the module initialization will fail.
> > > > 
> > > > I really don't know what this is trying to say.
> > 
> > My intention is to explain and call out that such design (use all memory regions
> > in memblock at the time of module initialization) works in practice, as long as
> > non-CMR memory hasn't been added via memory hotplug.
> > 
> > Not sure if it is necessary, but I was thinking it may help reviewer to judge
> > whether such design is acceptable.
> 
> This is yet another case where you've mechanically described the "what",
> but left out the implications or the underlying basis "why".
> 
> I'd take a more methodical approach to describe what is going on here.
> List the steps that must occur, or at least *one* example of those steps
> and how they intereact with the code in this patch.  Then, explain the
> fallout.
> 
> I also don't think it's quite right to call out "CXL memory via kmem
> driver".  If the CXL memory was "System RAM", it should get covered by a
> CMR and TDMR.  The kmem driver can still go wild with it.
> 
> > > > *How* and *why* does this module initialization failure occur?
> > 
> > If we pass any non-CMR memory to the TDX module, the SEAMCALL (TDH.SYS.CONFIG)
> > will fail.
> 
> I'm frankly lost now.  Please go back and try to explain this better.
> Let me know if you want to iterate on this faster than resending this
> series five more times.  I've got some ideas.

Let me try to do my work first.  Thanks.

> 
> > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > > > the "TDX-capable" nodes.
> > > > 
> > > > Why does it need to be enhanced?  What's the problem?
> > 
> > The problem is after TDX module initialization, no more memory can be hot-added
> > to the page allocator.
> > 
> > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> > utilize all memory.
> > 
> > But probably it is not necessarily to call out in the changelog?
> 
> Let's say that we add this TDX-compatible-node ABI in the future.  What
> will old code do that doesn't know about this ABI?

Right.  The old app will break w/o knowing the new ABI.  One resolution, I
think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
TDX guest memory from those "TDX-capable" nodes.

Anyway, perhaps we can just delete this part from the changelog?

> 
> ...
> > > > > > +       list_for_each_entry(tmb, &tdx_memlist, list) {
> > > > > > +               /*
> > > > > > +                * The new range is TDX memory if it is fully covered by
> > > > > > +                * any TDX memory block.
> > > > > > +                *
> > > > > > +                * Note TDX memory blocks are originated from memblock
> > > > > > +                * memory regions, which can only be contiguous when two
> > > > > > +                * regions have different NUMA nodes or flags.  Therefore
> > > > > > +                * the new range cannot cross multiple TDX memory blocks.
> > > > > > +                */
> > > > > > +               if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > > > > > +                       return true;
> > > > > > +       }
> > > > > > +       return false;
> > > > > > +}
> > > > 
> > > > I don't really like that comment.  It should first state its behavior
> > > > and assumptions, like:
> > > > 
> > > >     This check assumes that the start_pfn<->end_pfn range does not
> > > >     cross multiple tdx_memlist entries.
> > > > 
> > > > Only then should it describe why that is OK:
> > > > 
> > > >     A single memory hotplug even across mutliple memblocks (from
> > > >     which tdx_memlist entries are derived) is impossible.  ... then
> > > >     actually explain
> > > > 
> > 
> > How about below?
> > 
> >         /*
> >          * This check assumes that the start_pfn<->end_pfn range does not cross
> >          * multiple tdx_memlist entries. A single memory hotplug event across
> >          * multiple memblocks (from which tdx_memlist entries are derived) is
> >          * impossible. That means start_pfn<->end_pfn range cannot exceed a
> >          * tdx_memlist entry, and the new range is TDX memory if it is fully
> >          * covered by any tdx_memlist entry.
> >          */
> 
> I was hoping you would actually explain why it is impossible.
> 
> Is there something fundamental that keeps a memory area that spans two
> nodes from being removed and then a new area added that is comprised of
> a single node?
> Boot time:
> 
> 	| memblock  |  memblock |
> 	<--Node=0--> <--Node=1-->
> 
> Funky hotplug... nothing to see here, then:
> 
> 	<--------Node=2-------->

I must have missed something, but how can this happen?

I had memory that this cannot happen because the BIOS always allocates address
ranges for all NUMA nodes during machine boot.  Those address ranges don't
necessarily need to have DIMM fully populated but they don't change during
machine's runtime.

> 
> I would believe that there is no current bare-metal TDX system that has
> an implementation like this.  But, the comments above speak like it's
> fundamentally impossible.  That should be clarified.
> 
> In other words, that comment talks about memblock attributes as being
> the core underlying reason that that simplified check is OK.  Is that
> it, or is it really the reduced hotplug feature set on TDX systems?

Let me do more homework and get back to you.

> 
> 
> ...
> > > > > > +        * Build the list of "TDX-usable" memory regions which cover all
> > > > > > +        * pages in the page allocator to guarantee that.  Do it while
> > > > > > +        * holding mem_hotplug_lock read-lock as the memory hotplug code
> > > > > > +        * path reads the @tdx_memlist to reject any new memory.
> > > > > > +        */
> > > > > > +       get_online_mems();
> > > > 
> > > > Oh, it actually uses the memory hotplug locking for list protection.
> > > > That's at least a bit subtle.  Please document that somewhere in the
> > > > functions that actually manipulate the list.
> > 
> > add_tdx_memblock() and free_tdx_memlist() eventually calls list_add_tail() and
> > list_del() to manipulate the list, but they actually takes 'struct list_head
> > *tmb_list' as argument. 'tdx_memlist' is passed to build_tdx_memlist() as input.
> > 
> > Do you mean document the locking around the implementation of add_tdx_memblock()
> > and free_tdx_memlist()?
> > 
> > Or should we just mention it around the 'tdx_memlist' variable?
> > 
> > /* All TDX-usable memory regions. Protected by memory hotplug locking. */
> > static LIST_HEAD(tdx_memlist);
> 
> I don't think I'd hate it being in all three spots.  Also "protected by
> memory hotplug locking" is pretty generic.  Please be more specific.

OK will do.

> 
> > > > I think it's also worth saying something here about the high-level
> > > > effects of what's going on:
> > > > 
> > > >     Take a snapshot of the memory configuration (memblocks).  This
> > > >     snapshot will be used to enable TDX support for *this* memory
> > > >     configuration only.  Use a memory hotplug notifier to ensure
> > > >     that no other RAM can be added outside of this configuration.
> > > > 
> > > > That's it, right?
> > 
> > Yes. I'll somehow include above into the comment around get_online_mems().
> > 
> > But should I move "Use a memory hotplug notifier ..." part to:
> > 
> >         err = register_memory_notifier(&tdx_memory_nb);
> > 
> > because this is where we actually use the memory hotplug notifier?
> 
> I actually want that snippet in the changelog.

Will do.

[snip]
Dave Hansen Jan. 10, 2023, 4:18 p.m. UTC | #5
On 1/10/23 04:09, Huang, Kai wrote:
> On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
>> On 1/9/23 03:48, Huang, Kai wrote:
>>>>>>> This can also be enhanced in the future, i.e. by allowing adding non-TDX
>>>>>>> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
>>>>>>> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
>>>>>>> needs to guarantee memory pages for TDX guests are always allocated from
>>>>>>> the "TDX-capable" nodes.
>>>>>
>>>>> Why does it need to be enhanced?  What's the problem?
>>>
>>> The problem is after TDX module initialization, no more memory can be hot-added
>>> to the page allocator.
>>>
>>> Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
>>> actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
>>> bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
>>> utilize all memory.
>>>
>>> But probably it is not necessarily to call out in the changelog?
>>
>> Let's say that we add this TDX-compatible-node ABI in the future.  What
>> will old code do that doesn't know about this ABI?
> 
> Right.  The old app will break w/o knowing the new ABI.  One resolution, I
> think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
> TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
> TDX guest memory from those "TDX-capable" nodes.

That doesn't actually hide all of the behavior from users.  Let's say
they do:

	numactl --membind=6 qemu-kvm ...

In other words, take all of this guest's memory and put it on node 6.
There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
they make it a TDX guest:

	numactl --membind=6 qemu-kvm -tdx ...

What happens?  Does the kernel silently ignore the --membind=6?  Or does
it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
memory on node 6.

In other words, I don't think the kernel can just enforce this
internally and hide it from userspace.

>> Is there something fundamental that keeps a memory area that spans two
>> nodes from being removed and then a new area added that is comprised of
>> a single node?
>> Boot time:
>>
>> 	| memblock  |  memblock |
>> 	<--Node=0--> <--Node=1-->
>>
>> Funky hotplug... nothing to see here, then:
>>
>> 	<--------Node=2-------->
> 
> I must have missed something, but how can this happen?
> 
> I had memory that this cannot happen because the BIOS always allocates address
> ranges for all NUMA nodes during machine boot.  Those address ranges don't
> necessarily need to have DIMM fully populated but they don't change during
> machine's runtime.

Is your memory correct?  Is there evidence, or requirements in any
specification to support your memory?
Kai Huang Jan. 11, 2023, 10 a.m. UTC | #6
On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
> On 1/10/23 04:09, Huang, Kai wrote:
> > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> > > On 1/9/23 03:48, Huang, Kai wrote:
> > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > > > > > the "TDX-capable" nodes.
> > > > > > 
> > > > > > Why does it need to be enhanced?  What's the problem?
> > > > 
> > > > The problem is after TDX module initialization, no more memory can be hot-added
> > > > to the page allocator.
> > > > 
> > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> > > > utilize all memory.
> > > > 
> > > > But probably it is not necessarily to call out in the changelog?
> > > 
> > > Let's say that we add this TDX-compatible-node ABI in the future.  What
> > > will old code do that doesn't know about this ABI?
> > 
> > Right.  The old app will break w/o knowing the new ABI.  One resolution, I
> > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
> > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
> > TDX guest memory from those "TDX-capable" nodes.
> 
> That doesn't actually hide all of the behavior from users.  Let's say
> they do:
> 
> 	numactl --membind=6 qemu-kvm ...
> 
> In other words, take all of this guest's memory and put it on node 6.
> There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
> they make it a TDX guest:
> 
> 	numactl --membind=6 qemu-kvm -tdx ...
> 
> What happens?  Does the kernel silently ignore the --membind=6?  Or does
> it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
> memory on node 6.
> 
> In other words, I don't think the kernel can just enforce this
> internally and hide it from userspace.

IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
the memory will always be allocated from those "TDX-capable" nodes.  KVM can
refuse to create the TDX guest if it found task's mempolicy doesn't have any
"TDX-capable" node, and print out a clear message to the userspace.

But I am new to the core-mm, so I might have some misunderstanding.

> 
> > > Is there something fundamental that keeps a memory area that spans two
> > > nodes from being removed and then a new area added that is comprised of
> > > a single node?
> > > Boot time:
> > > 
> > > 	| memblock  |  memblock |
> > > 	<--Node=0--> <--Node=1-->
> > > 
> > > Funky hotplug... nothing to see here, then:
> > > 
> > > 	<--------Node=2-------->
> > 
> > I must have missed something, but how can this happen?
> > 
> > I had memory that this cannot happen because the BIOS always allocates address
> > ranges for all NUMA nodes during machine boot.  Those address ranges don't
> > necessarily need to have DIMM fully populated but they don't change during
> > machine's runtime.
> 
> Is your memory correct?  Is there evidence, or requirements in any
> specification to support your memory?
> 

I tried to find whether there's any spec mentioning this, but so far didn't find
any.  I'll ask around to see whether this case can happen.

At the meantime, I also spent some time looking into the memory hotplug code
more deeply.  Below is my thinking:

For TDX system, AFAICT a non-buggy BIOS won't support physically hot-removing
CMR memory (thus no hot-add of CMR memory either).  So we are either talking
about hot-adding of non-TDX-usable memory (those are not configured to TDX
module), or kernel soft offline -> (optional remove -> add ->) online any TDX-
usable memory.

For the former we don't need to care about whether the new range can cross
multiple tdx_memlist entries.  For the latter, the offline granularity is
'struct memory_block', which is a fixed size after boot IIUC.  

And we can only offline one memory_block when it meets: 1) no memory hole, and;
2) all pages are in single zone.  IIUC this means it's not possible that we
offline two adjacent contiguous tdx_memlist entries and then online them
together as a single one.
Huang, Ying Jan. 12, 2023, 12:56 a.m. UTC | #7
"Huang, Kai" <kai.huang@intel.com> writes:

> On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
>> On 1/10/23 04:09, Huang, Kai wrote:
>> > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
>> > > On 1/9/23 03:48, Huang, Kai wrote:
>> > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
>> > > > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
>> > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
>> > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
>> > > > > > > > the "TDX-capable" nodes.
>> > > > > >
>> > > > > > Why does it need to be enhanced?  What's the problem?
>> > > >
>> > > > The problem is after TDX module initialization, no more memory can be hot-added
>> > > > to the page allocator.
>> > > >
>> > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
>> > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
>> > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
>> > > > utilize all memory.
>> > > >
>> > > > But probably it is not necessarily to call out in the changelog?
>> > >
>> > > Let's say that we add this TDX-compatible-node ABI in the future.  What
>> > > will old code do that doesn't know about this ABI?
>> >
>> > Right.  The old app will break w/o knowing the new ABI.  One resolution, I
>> > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
>> > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
>> > TDX guest memory from those "TDX-capable" nodes.
>>
>> That doesn't actually hide all of the behavior from users.  Let's say
>> they do:
>>
>>       numactl --membind=6 qemu-kvm ...
>>
>> In other words, take all of this guest's memory and put it on node 6.
>> There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
>> they make it a TDX guest:
>>
>>       numactl --membind=6 qemu-kvm -tdx ...
>>
>> What happens?  Does the kernel silently ignore the --membind=6?  Or does
>> it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
>> memory on node 6.
>>
>> In other words, I don't think the kernel can just enforce this
>> internally and hide it from userspace.
>
> IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
> guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
> the memory will always be allocated from those "TDX-capable" nodes.  KVM can
> refuse to create the TDX guest if it found task's mempolicy doesn't have any
> "TDX-capable" node, and print out a clear message to the userspace.
>
> But I am new to the core-mm, so I might have some misunderstanding.

KVM here means in-kernel KVM module?  If so, KVM can only output some
message in dmesg.  Which isn't very good for users to digest.  It's
better for the user space QEMU to detect whether current configuration
is usable and respond to users, via GUI, or syslog, etc.

Best Regards,
Huang, Ying
Kai Huang Jan. 12, 2023, 1:18 a.m. UTC | #8
On Thu, 2023-01-12 at 08:56 +0800, Huang, Ying wrote:
> "Huang, Kai" <kai.huang@intel.com> writes:
> 
> > On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
> > > On 1/10/23 04:09, Huang, Kai wrote:
> > > > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> > > > > On 1/9/23 03:48, Huang, Kai wrote:
> > > > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > > > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > > > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > > > > > > > the "TDX-capable" nodes.
> > > > > > > > 
> > > > > > > > Why does it need to be enhanced?  What's the problem?
> > > > > > 
> > > > > > The problem is after TDX module initialization, no more memory can be hot-added
> > > > > > to the page allocator.
> > > > > > 
> > > > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> > > > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> > > > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> > > > > > utilize all memory.
> > > > > > 
> > > > > > But probably it is not necessarily to call out in the changelog?
> > > > > 
> > > > > Let's say that we add this TDX-compatible-node ABI in the future.  What
> > > > > will old code do that doesn't know about this ABI?
> > > > 
> > > > Right.  The old app will break w/o knowing the new ABI.  One resolution, I
> > > > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
> > > > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
> > > > TDX guest memory from those "TDX-capable" nodes.
> > > 
> > > That doesn't actually hide all of the behavior from users.  Let's say
> > > they do:
> > > 
> > >       numactl --membind=6 qemu-kvm ...
> > > 
> > > In other words, take all of this guest's memory and put it on node 6.
> > > There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
> > > they make it a TDX guest:
> > > 
> > >       numactl --membind=6 qemu-kvm -tdx ...
> > > 
> > > What happens?  Does the kernel silently ignore the --membind=6?  Or does
> > > it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
> > > memory on node 6.
> > > 
> > > In other words, I don't think the kernel can just enforce this
> > > internally and hide it from userspace.
> > 
> > IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
> > guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
> > the memory will always be allocated from those "TDX-capable" nodes.  KVM can
> > refuse to create the TDX guest if it found task's mempolicy doesn't have any
> > "TDX-capable" node, and print out a clear message to the userspace.
> > 
> > But I am new to the core-mm, so I might have some misunderstanding.
> 
> KVM here means in-kernel KVM module?  If so, KVM can only output some
> message in dmesg.  Which isn't very good for users to digest.  It's
> better for the user space QEMU to detect whether current configuration
> is usable and respond to users, via GUI, or syslog, etc.

I am not against this. For instance, maybe we can add some dedicated error code
and let KVM return it to Qemu, but I don't want to speak for KVM guys.  We can
discuss this more when we have patches actually sent out to the community.
Huang, Ying Jan. 12, 2023, 1:59 a.m. UTC | #9
"Huang, Kai" <kai.huang@intel.com> writes:

> On Thu, 2023-01-12 at 08:56 +0800, Huang, Ying wrote:
>> "Huang, Kai" <kai.huang@intel.com> writes:
>>
>> > On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
>> > > On 1/10/23 04:09, Huang, Kai wrote:
>> > > > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
>> > > > > On 1/9/23 03:48, Huang, Kai wrote:
>> > > > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
>> > > > > > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
>> > > > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
>> > > > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
>> > > > > > > > > > the "TDX-capable" nodes.
>> > > > > > > >
>> > > > > > > > Why does it need to be enhanced?  What's the problem?
>> > > > > >
>> > > > > > The problem is after TDX module initialization, no more memory can be hot-added
>> > > > > > to the page allocator.
>> > > > > >
>> > > > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
>> > > > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
>> > > > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
>> > > > > > utilize all memory.
>> > > > > >
>> > > > > > But probably it is not necessarily to call out in the changelog?
>> > > > >
>> > > > > Let's say that we add this TDX-compatible-node ABI in the future.  What
>> > > > > will old code do that doesn't know about this ABI?
>> > > >
>> > > > Right.  The old app will break w/o knowing the new ABI.  One resolution, I
>> > > > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
>> > > > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
>> > > > TDX guest memory from those "TDX-capable" nodes.
>> > >
>> > > That doesn't actually hide all of the behavior from users.  Let's say
>> > > they do:
>> > >
>> > >       numactl --membind=6 qemu-kvm ...
>> > >
>> > > In other words, take all of this guest's memory and put it on node 6.
>> > > There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
>> > > they make it a TDX guest:
>> > >
>> > >       numactl --membind=6 qemu-kvm -tdx ...
>> > >
>> > > What happens?  Does the kernel silently ignore the --membind=6?  Or does
>> > > it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
>> > > memory on node 6.
>> > >
>> > > In other words, I don't think the kernel can just enforce this
>> > > internally and hide it from userspace.
>> >
>> > IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
>> > guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
>> > the memory will always be allocated from those "TDX-capable" nodes.  KVM can
>> > refuse to create the TDX guest if it found task's mempolicy doesn't have any
>> > "TDX-capable" node, and print out a clear message to the userspace.
>> >
>> > But I am new to the core-mm, so I might have some misunderstanding.
>>
>> KVM here means in-kernel KVM module?  If so, KVM can only output some
>> message in dmesg.  Which isn't very good for users to digest.  It's
>> better for the user space QEMU to detect whether current configuration
>> is usable and respond to users, via GUI, or syslog, etc.
>
> I am not against this. For instance, maybe we can add some dedicated error code
> and let KVM return it to Qemu, but I don't want to speak for KVM guys.  We can
> discuss this more when we have patches actually sent out to the community.

Error code is a kind of ABI too. :-)

Best Regards,
Huang, Ying
Kai Huang Jan. 12, 2023, 2:22 a.m. UTC | #10
On Thu, 2023-01-12 at 09:59 +0800, Huang, Ying wrote:
> "Huang, Kai" <kai.huang@intel.com> writes:
> 
> > On Thu, 2023-01-12 at 08:56 +0800, Huang, Ying wrote:
> > > "Huang, Kai" <kai.huang@intel.com> writes:
> > > 
> > > > On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
> > > > > On 1/10/23 04:09, Huang, Kai wrote:
> > > > > > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> > > > > > > On 1/9/23 03:48, Huang, Kai wrote:
> > > > > > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > > > > > > > > > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > > > > > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > > > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > > > > > > > > > the "TDX-capable" nodes.
> > > > > > > > > > 
> > > > > > > > > > Why does it need to be enhanced?  What's the problem?
> > > > > > > > 
> > > > > > > > The problem is after TDX module initialization, no more memory can be hot-added
> > > > > > > > to the page allocator.
> > > > > > > > 
> > > > > > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> > > > > > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> > > > > > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> > > > > > > > utilize all memory.
> > > > > > > > 
> > > > > > > > But probably it is not necessarily to call out in the changelog?
> > > > > > > 
> > > > > > > Let's say that we add this TDX-compatible-node ABI in the future.  What
> > > > > > > will old code do that doesn't know about this ABI?
> > > > > > 
> > > > > > Right.  The old app will break w/o knowing the new ABI.  One resolution, I
> > > > > > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
> > > > > > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
> > > > > > TDX guest memory from those "TDX-capable" nodes.
> > > > > 
> > > > > That doesn't actually hide all of the behavior from users.  Let's say
> > > > > they do:
> > > > > 
> > > > >       numactl --membind=6 qemu-kvm ...
> > > > > 
> > > > > In other words, take all of this guest's memory and put it on node 6.
> > > > > There lots of free memory on node 6 which is TDX-*IN*compatible.  Then,
> > > > > they make it a TDX guest:
> > > > > 
> > > > >       numactl --membind=6 qemu-kvm -tdx ...
> > > > > 
> > > > > What happens?  Does the kernel silently ignore the --membind=6?  Or does
> > > > > it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
> > > > > memory on node 6.
> > > > > 
> > > > > In other words, I don't think the kernel can just enforce this
> > > > > internally and hide it from userspace.
> > > > 
> > > > IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
> > > > guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
> > > > the memory will always be allocated from those "TDX-capable" nodes.  KVM can
> > > > refuse to create the TDX guest if it found task's mempolicy doesn't have any
> > > > "TDX-capable" node, and print out a clear message to the userspace.
> > > > 
> > > > But I am new to the core-mm, so I might have some misunderstanding.
> > > 
> > > KVM here means in-kernel KVM module?  If so, KVM can only output some
> > > message in dmesg.  Which isn't very good for users to digest.  It's
> > > better for the user space QEMU to detect whether current configuration
> > > is usable and respond to users, via GUI, or syslog, etc.
> > 
> > I am not against this. For instance, maybe we can add some dedicated error code
> > and let KVM return it to Qemu, but I don't want to speak for KVM guys.  We can
> > discuss this more when we have patches actually sent out to the community.
> 
> Error code is a kind of ABI too. :-)
> 

Right.  I can bring this up in the KVM TDX support series (when time is right)
to see whether KVM guys want such error code at the initial support, or just
want to extend in the future.  The worst case is the old Qemu may not recognize
the new error code (while the error is still available in dmesg) but still can
do the right behaviour (stop to run, etc).
Kai Huang Jan. 12, 2023, 11:33 a.m. UTC | #11
On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> > > > > > +       list_for_each_entry(tmb, &tdx_memlist, list) {
> > > > > > +               /*
> > > > > > +                * The new range is TDX memory if it is fully
> > > > > > covered by
> > > > > > +                * any TDX memory block.
> > > > > > +                *
> > > > > > +                * Note TDX memory blocks are originated from
> > > > > > memblock
> > > > > > +                * memory regions, which can only be contiguous when
> > > > > > two
> > > > > > +                * regions have different NUMA nodes or flags. 
> > > > > > Therefore
> > > > > > +                * the new range cannot cross multiple TDX memory
> > > > > > blocks.
> > > > > > +                */
> > > > > > +               if (start_pfn >= tmb->start_pfn && end_pfn <= tmb-
> > > > > > >end_pfn)
> > > > > > +                       return true;
> > > > > > +       }
> > > > > > +       return false;
> > > > > > +}
> > > > 
> > > > I don't really like that comment.  It should first state its behavior
> > > > and assumptions, like:
> > > > 
> > > >      This check assumes that the start_pfn<->end_pfn range does not
> > > >      cross multiple tdx_memlist entries.
> > > > 
> > > > Only then should it describe why that is OK:
> > > > 
> > > >      A single memory hotplug even across mutliple memblocks (from
> > > >      which tdx_memlist entries are derived) is impossible.  ... then
> > > >      actually explain
> > > > 
> > 
> > How about below?
> > 
> >          /*
> >           * This check assumes that the start_pfn<->end_pfn range does not
> > cross
> >           * multiple tdx_memlist entries. A single memory hotplug event
> > across
> >           * multiple memblocks (from which tdx_memlist entries are derived)
> > is
> >           * impossible. That means start_pfn<->end_pfn range cannot exceed a
> >           * tdx_memlist entry, and the new range is TDX memory if it is
> > fully
> >           * covered by any tdx_memlist entry.
> >           */
> 
> I was hoping you would actually explain why it is impossible.
> 
> Is there something fundamental that keeps a memory area that spans two
> nodes from being removed and then a new area added that is comprised of
> a single node?
> 
> Boot time:
> 
> 	| memblock  |  memblock |
> 	<--Node=0--> <--Node=1-->
> 
> Funky hotplug... nothing to see here, then:
> 
> 	<--------Node=2-------->
> 
> I would believe that there is no current bare-metal TDX system that has
> an implementation like this.  But, the comments above speak like it's
> fundamentally impossible.  That should be clarified.
> 
> In other words, that comment talks about memblock attributes as being
> the core underlying reason that that simplified check is OK.  Is that
> it, or is it really the reduced hotplug feature set on TDX systems?

Hi Dave,

I think I have been forgetting that we have switched to reject non-TDX memory in
memory online, but not in memory hot-add.  

Memory offline/online is done on granularity of 'struct memory_block', but not
memblock.  In fact, the hotpluggable memory region (one memblock) must be
multiple of memory_block, and a "to-be-online" memory_block must be full range
memory (no memory hole).

So if I am not missing something, IIUC that means if the start_pfn<->end_pfn is
TDX memory, it must be fully within some @tdx_memlist entry, but cannot cross
multiple small entries.  And the memory hotplug case in your above diagram
actually shouldn't matter.

If above stands, how about below?

        /*
         * This check assumes that the start_pfn<->end_pfn range does not 
         * cross multiple @tdx_memlist entries.  A single memory online   
         * event across multiple @tdx_memlist entries (which are derived  
         * from memblocks at the time of module initialization) is not    
         * possible.
         *
         * This is because memory offline/online is done on granularity   
         * of 'struct memory_block', and the hotpluggable memory region   
         * (one memblock) must be multiple of memory_block.  Also, the    
         * "to-be-online" memory_block must be full memory (no memory     
         * hole, i.e. containing multiple small memblocks).
         *
         * This means if the start_pfn<->end_pfn range is TDX memory, it  
         * must be fully within one @tdx_memlist entry, but cannot cross  
         * multiple small entries.
         */
        list_for_each_entry(tmb, &tdx_memlist, list) {
                if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
                        return true;
        }
Kai Huang Jan. 18, 2023, 11:08 a.m. UTC | #12
+Dave, Oscar and Andrew.

Hi Memory hotplug maintainers,

Sorry to CC, but could you help to review this Intel TDX (Trusted Domain
Extensions) patch, since it is related to memory hotplug (not modifying any
common memory hotplug directly, though)?  Dave suggested it's better to get
memory hotplug guys to help to review sooner than later.

This whole series already has linux-mm@kvack.org in the CC list.  Thanks for
your time.

On Fri, 2022-12-09 at 19:52 +1300, Kai Huang wrote:
> As a step of initializing the TDX module, the kernel needs to tell the
> TDX module which memory regions can be used by the TDX module as TDX
> guest memory.
> 
> TDX reports a list of "Convertible Memory Region" (CMR) to tell the
> kernel which memory is TDX compatible.  The kernel needs to build a list
> of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
> the TDX module.  Once this is done, those "TDX-usable" memory regions
> are fixed during module's lifetime.
> 
> The initial support of TDX guests will only allocate TDX guest memory
> from the global page allocator.  To keep things simple, just make sure
> all pages in the page allocator are TDX memory.
> 
> To guarantee that, stash off the memblock memory regions at the time of
> initializing the TDX module as TDX's own usable memory regions, and in
> the meantime, register a TDX memory notifier to reject to online any new
> memory in memory hotplug.
> 
> This approach works as in practice all boot-time present DIMMs are TDX
> convertible memory.  However, if any non-TDX-convertible memory has been
> hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
> module, the module initialization will fail.
> 
> This can also be enhanced in the future, i.e. by allowing adding non-TDX
> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> needs to guarantee memory pages for TDX guests are always allocated from
> the "TDX-capable" nodes.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v7 -> v8:
>  - Trimed down changelog (Dave).
>  - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
>    (Ying).
>  - Moved memory hotplug handling from add_arch_memory() to
>    memory_notifier (Dan/David).
>  - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
>  - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
>  - Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
>  - Improve the comment around first 1MB (Dave).
>  - Added a comment around reserve_real_mode() to point out TDX code
>    relies on first 1MB being reserved (Ying).
>  - Added comment to explain why the new online memory range cannot
>    cross multiple TDX memory blocks (Dave).
>  - Improved other comments (Dave).
> 
> ---
>  arch/x86/Kconfig            |   1 +
>  arch/x86/kernel/setup.c     |   2 +
>  arch/x86/virt/vmx/tdx/tdx.c | 160 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 162 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index dd333b46fafb..b36129183035 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
>  	depends on X86_64
>  	depends on KVM_INTEL
>  	depends on X86_X2APIC
> +	select ARCH_KEEP_MEMBLOCK
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 216fee7144ee..3a841a77fda4 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1174,6 +1174,8 @@ void __init setup_arch(char **cmdline_p)
>  	 *
>  	 * Moreover, on machines with SandyBridge graphics or in setups that use
>  	 * crashkernel the entire 1M is reserved anyway.
> +	 *
> +	 * Note the host kernel TDX also requires the first 1MB being reserved.
>  	 */
>  	reserve_real_mode();
>  
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 6fe505c32599..f010402f443d 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,13 @@
>  #include <linux/errno.h>
>  #include <linux/printk.h>
>  #include <linux/mutex.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/memblock.h>
> +#include <linux/memory.h>
> +#include <linux/minmax.h>
> +#include <linux/sizes.h>
> +#include <linux/pfn.h>
>  #include <asm/pgtable_types.h>
>  #include <asm/msr.h>
>  #include <asm/tdx.h>
> @@ -25,6 +32,12 @@ enum tdx_module_status_t {
>  	TDX_MODULE_ERROR
>  };
>  
> +struct tdx_memblock {
> +	struct list_head list;
> +	unsigned long start_pfn;
> +	unsigned long end_pfn;
> +};
> +
>  static u32 tdx_keyid_start __ro_after_init;
>  static u32 nr_tdx_keyids __ro_after_init;
>  
> @@ -32,6 +45,9 @@ static enum tdx_module_status_t tdx_module_status;
>  /* Prevent concurrent attempts on TDX detection and initialization */
>  static DEFINE_MUTEX(tdx_module_lock);
>  
> +/* All TDX-usable memory regions */
> +static LIST_HEAD(tdx_memlist);
> +
>  /*
>   * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
>   * This is used in TDX initialization error paths to take it from
> @@ -69,6 +85,50 @@ static int __init record_keyid_partitioning(void)
>  	return 0;
>  }
>  
> +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	/* Empty list means TDX isn't enabled. */
> +	if (list_empty(&tdx_memlist))
> +		return true;
> +
> +	list_for_each_entry(tmb, &tdx_memlist, list) {
> +		/*
> +		 * The new range is TDX memory if it is fully covered by
> +		 * any TDX memory block.
> +		 *
> +		 * Note TDX memory blocks are originated from memblock
> +		 * memory regions, which can only be contiguous when two
> +		 * regions have different NUMA nodes or flags.  Therefore
> +		 * the new range cannot cross multiple TDX memory blocks.
> +		 */
> +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
> +			       void *v)
> +{
> +	struct memory_notify *mn = v;
> +
> +	if (action != MEM_GOING_ONLINE)
> +		return NOTIFY_OK;
> +
> +	/*
> +	 * Not all memory is compatible with TDX.  Reject
> +	 * to online any incompatible memory.
> +	 */
> +	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
> +		NOTIFY_OK : NOTIFY_BAD;
> +}
> +
> +static struct notifier_block tdx_memory_nb = {
> +	.notifier_call = tdx_memory_notifier,
> +};
> +
>  static int __init tdx_init(void)
>  {
>  	int err;
> @@ -89,6 +149,13 @@ static int __init tdx_init(void)
>  		goto no_tdx;
>  	}
>  
> +	err = register_memory_notifier(&tdx_memory_nb);
> +	if (err) {
> +		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
> +				err);
> +		goto no_tdx;
> +	}
> +
>  	return 0;
>  no_tdx:
>  	clear_tdx();
> @@ -209,6 +276,77 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
>  	return 0;
>  }
>  
> +/*
> + * Add a memory region as a TDX memory block.  The caller must make sure
> + * all memory regions are added in address ascending order and don't
> + * overlap.
> + */
> +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> +			    unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> +	if (!tmb)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tmb->list);
> +	tmb->start_pfn = start_pfn;
> +	tmb->end_pfn = end_pfn;
> +
> +	list_add_tail(&tmb->list, tmb_list);
> +	return 0;
> +}
> +
> +static void free_tdx_memlist(struct list_head *tmb_list)
> +{
> +	while (!list_empty(tmb_list)) {
> +		struct tdx_memblock *tmb = list_first_entry(tmb_list,
> +				struct tdx_memblock, list);
> +
> +		list_del(&tmb->list);
> +		kfree(tmb);
> +	}
> +}
> +
> +/*
> + * Ensure that all memblock memory regions are convertible to TDX
> + * memory.  Once this has been established, stash the memblock
> + * ranges off in a secondary structure because memblock is modified
> + * in memory hotplug while TDX memory regions are fixed.
> + */
> +static int build_tdx_memlist(struct list_head *tmb_list)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i, ret;
> +
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> +		/*
> +		 * The first 1MB is not reported as TDX convertible memory.
> +		 * Although the first 1MB is always reserved and won't end up
> +		 * to the page allocator, it is still in memblock's memory
> +		 * regions.  Skip them manually to exclude them as TDX memory.
> +		 */
> +		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
> +		if (start_pfn >= end_pfn)
> +			continue;
> +
> +		/*
> +		 * Add the memory regions as TDX memory.  The regions in
> +		 * memblock has already guaranteed they are in address
> +		 * ascending order and don't overlap.
> +		 */
> +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	free_tdx_memlist(tmb_list);
> +	return ret;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	/*
> @@ -226,10 +364,25 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/*
> +	 * The initial support of TDX guests only allocates memory from
> +	 * the global page allocator.  To keep things simple, just make
> +	 * sure all pages in the page allocator are TDX memory.
> +	 *
> +	 * Build the list of "TDX-usable" memory regions which cover all
> +	 * pages in the page allocator to guarantee that.  Do it while
> +	 * holding mem_hotplug_lock read-lock as the memory hotplug code
> +	 * path reads the @tdx_memlist to reject any new memory.
> +	 */
> +	get_online_mems();
> +
> +	ret = build_tdx_memlist(&tdx_memlist);
> +	if (ret)
> +		goto out;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Build the list of TDX-usable memory regions.
>  	 *  - Construct a list of TDMRs to cover all TDX-usable memory
>  	 *    regions.
>  	 *  - Pick up one TDX private KeyID as the global KeyID.
> @@ -241,6 +394,11 @@ static int init_tdx_module(void)
>  	 */
>  	ret = -EINVAL;
>  out:
> +	/*
> +	 * @tdx_memlist is written here and read at memory hotplug time.
> +	 * Lock out memory hotplug code while building it.
> +	 */
> +	put_online_mems();
>  	return ret;
>  }
>
David Hildenbrand Jan. 18, 2023, 1:57 p.m. UTC | #13
On 18.01.23 12:08, Huang, Kai wrote:
> +Dave, Oscar and Andrew.
> 
> Hi Memory hotplug maintainers,
> 
> Sorry to CC, but could you help to review this Intel TDX (Trusted Domain
> Extensions) patch, since it is related to memory hotplug (not modifying any
> common memory hotplug directly, though)?  Dave suggested it's better to get
> memory hotplug guys to help to review sooner than later.
> 
> This whole series already has linux-mm@kvack.org in the CC list.  Thanks for
> your time.

Hi,

I remember discussing that part (notifier) already and it looked good to 
me. No objection from my side.
Kai Huang Jan. 18, 2023, 7:38 p.m. UTC | #14
On Wed, 2023-01-18 at 14:57 +0100, David Hildenbrand wrote:
> On 18.01.23 12:08, Huang, Kai wrote:
> > +Dave, Oscar and Andrew.
> > 
> > Hi Memory hotplug maintainers,
> > 
> > Sorry to CC, but could you help to review this Intel TDX (Trusted Domain
> > Extensions) patch, since it is related to memory hotplug (not modifying any
> > common memory hotplug directly, though)?  Dave suggested it's better to get
> > memory hotplug guys to help to review sooner than later.
> > 
> > This whole series already has linux-mm@kvack.org in the CC list.  Thanks for
> > your time.
> 
> Hi,
> 
> I remember discussing that part (notifier) already and it looked good to 
> me. No objection from my side.
> 

Yes.  Thanks!
diff mbox series

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index dd333b46fafb..b36129183035 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1959,6 +1959,7 @@  config INTEL_TDX_HOST
 	depends on X86_64
 	depends on KVM_INTEL
 	depends on X86_X2APIC
+	select ARCH_KEEP_MEMBLOCK
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 216fee7144ee..3a841a77fda4 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1174,6 +1174,8 @@  void __init setup_arch(char **cmdline_p)
 	 *
 	 * Moreover, on machines with SandyBridge graphics or in setups that use
 	 * crashkernel the entire 1M is reserved anyway.
+	 *
+	 * Note the host kernel TDX also requires the first 1MB being reserved.
 	 */
 	reserve_real_mode();
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6fe505c32599..f010402f443d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,13 @@ 
 #include <linux/errno.h>
 #include <linux/printk.h>
 #include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/memblock.h>
+#include <linux/memory.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
+#include <linux/pfn.h>
 #include <asm/pgtable_types.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
@@ -25,6 +32,12 @@  enum tdx_module_status_t {
 	TDX_MODULE_ERROR
 };
 
+struct tdx_memblock {
+	struct list_head list;
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+};
+
 static u32 tdx_keyid_start __ro_after_init;
 static u32 nr_tdx_keyids __ro_after_init;
 
@@ -32,6 +45,9 @@  static enum tdx_module_status_t tdx_module_status;
 /* Prevent concurrent attempts on TDX detection and initialization */
 static DEFINE_MUTEX(tdx_module_lock);
 
+/* All TDX-usable memory regions */
+static LIST_HEAD(tdx_memlist);
+
 /*
  * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
  * This is used in TDX initialization error paths to take it from
@@ -69,6 +85,50 @@  static int __init record_keyid_partitioning(void)
 	return 0;
 }
 
+static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	/* Empty list means TDX isn't enabled. */
+	if (list_empty(&tdx_memlist))
+		return true;
+
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		/*
+		 * The new range is TDX memory if it is fully covered by
+		 * any TDX memory block.
+		 *
+		 * Note TDX memory blocks are originated from memblock
+		 * memory regions, which can only be contiguous when two
+		 * regions have different NUMA nodes or flags.  Therefore
+		 * the new range cannot cross multiple TDX memory blocks.
+		 */
+		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
+			return true;
+	}
+	return false;
+}
+
+static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
+			       void *v)
+{
+	struct memory_notify *mn = v;
+
+	if (action != MEM_GOING_ONLINE)
+		return NOTIFY_OK;
+
+	/*
+	 * Not all memory is compatible with TDX.  Reject
+	 * to online any incompatible memory.
+	 */
+	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
+		NOTIFY_OK : NOTIFY_BAD;
+}
+
+static struct notifier_block tdx_memory_nb = {
+	.notifier_call = tdx_memory_notifier,
+};
+
 static int __init tdx_init(void)
 {
 	int err;
@@ -89,6 +149,13 @@  static int __init tdx_init(void)
 		goto no_tdx;
 	}
 
+	err = register_memory_notifier(&tdx_memory_nb);
+	if (err) {
+		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
+				err);
+		goto no_tdx;
+	}
+
 	return 0;
 no_tdx:
 	clear_tdx();
@@ -209,6 +276,77 @@  static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
 	return 0;
 }
 
+/*
+ * Add a memory region as a TDX memory block.  The caller must make sure
+ * all memory regions are added in address ascending order and don't
+ * overlap.
+ */
+static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
+			    unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
+	if (!tmb)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&tmb->list);
+	tmb->start_pfn = start_pfn;
+	tmb->end_pfn = end_pfn;
+
+	list_add_tail(&tmb->list, tmb_list);
+	return 0;
+}
+
+static void free_tdx_memlist(struct list_head *tmb_list)
+{
+	while (!list_empty(tmb_list)) {
+		struct tdx_memblock *tmb = list_first_entry(tmb_list,
+				struct tdx_memblock, list);
+
+		list_del(&tmb->list);
+		kfree(tmb);
+	}
+}
+
+/*
+ * Ensure that all memblock memory regions are convertible to TDX
+ * memory.  Once this has been established, stash the memblock
+ * ranges off in a secondary structure because memblock is modified
+ * in memory hotplug while TDX memory regions are fixed.
+ */
+static int build_tdx_memlist(struct list_head *tmb_list)
+{
+	unsigned long start_pfn, end_pfn;
+	int i, ret;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+		/*
+		 * The first 1MB is not reported as TDX convertible memory.
+		 * Although the first 1MB is always reserved and won't end up
+		 * to the page allocator, it is still in memblock's memory
+		 * regions.  Skip them manually to exclude them as TDX memory.
+		 */
+		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
+		if (start_pfn >= end_pfn)
+			continue;
+
+		/*
+		 * Add the memory regions as TDX memory.  The regions in
+		 * memblock has already guaranteed they are in address
+		 * ascending order and don't overlap.
+		 */
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	free_tdx_memlist(tmb_list);
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	/*
@@ -226,10 +364,25 @@  static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/*
+	 * The initial support of TDX guests only allocates memory from
+	 * the global page allocator.  To keep things simple, just make
+	 * sure all pages in the page allocator are TDX memory.
+	 *
+	 * Build the list of "TDX-usable" memory regions which cover all
+	 * pages in the page allocator to guarantee that.  Do it while
+	 * holding mem_hotplug_lock read-lock as the memory hotplug code
+	 * path reads the @tdx_memlist to reject any new memory.
+	 */
+	get_online_mems();
+
+	ret = build_tdx_memlist(&tdx_memlist);
+	if (ret)
+		goto out;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of TDMRs to cover all TDX-usable memory
 	 *    regions.
 	 *  - Pick up one TDX private KeyID as the global KeyID.
@@ -241,6 +394,11 @@  static int init_tdx_module(void)
 	 */
 	ret = -EINVAL;
 out:
+	/*
+	 * @tdx_memlist is written here and read at memory hotplug time.
+	 * Lock out memory hotplug code while building it.
+	 */
+	put_online_mems();
 	return ret;
 }