diff mbox series

[v7,13/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

Message ID ef6cdab2c371b9f068f2b4bf493b1dd0c9bb3c99.1668988357.git.kai.huang@intel.com (mailing list archive)
State New
Headers show
Series TDX host kernel support | expand

Commit Message

Huang, Kai Nov. 21, 2022, 12:26 a.m. UTC
The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory.  This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module.  PAMTs are not reserved by hardware
up front.  They must be allocated by the kernel and then given to the
TDX module.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime.  One (bad)
mitigation is to launch a TD guest early during system boot to get those
PAMTs allocated at early time, but the only way to fix is to add a boot
option to allocate or reserve PAMTs during kernel boot.

TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR.  If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

  - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
    the total number of reserved areas consumed for PAMTs.
  - Try to first allocate PAMT from the local node of the TDMR for better
    NUMA locality.

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Changes due to using macros instead of 'enum' for TDX supported page
   sizes.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.
 - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis).
 - Improved comment around tdmr_get_nid() (Dave).
 - Improved comment in tdmr_set_up_pamt() around breaking the PAMT
   into PAMTs for 4K/2M/1G (Dave).
 - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).   

- v3 -> v5 (no feedback on v4):
 - Used memblock to get the NUMA node for given TDMR.
 - Removed tdmr_get_pamt_sz() helper but use open-code instead.
 - Changed to use 'switch .. case..' for each TDX supported page size in
   tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
 - Added printing out memory used for PAMT allocation when TDX module is
   initialized successfully.
 - Explained downside of alloc_contig_pages() in changelog.
 - Addressed other minor comments.


---
 arch/x86/Kconfig            |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 191 ++++++++++++++++++++++++++++++++++++
 2 files changed, 192 insertions(+)

Comments

Dave Hansen Nov. 23, 2022, 10:57 p.m. UTC | #1
On 11/20/22 16:26, Kai Huang wrote:
> The TDX module uses additional metadata to record things like which
> guest "owns" a given page of memory.  This metadata, referred as
> Physical Address Metadata Table (PAMT), essentially serves as the
> 'struct page' for the TDX module.  PAMTs are not reserved by hardware
> up front.  They must be allocated by the kernel and then given to the
> TDX module.

... during module initialization.

> TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
> (TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
> be a physically contiguous area from a Convertible Memory Region (CMR).
> However, the PAMTs which track pages in one TDMR do not need to reside
> within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
> any TDMR, the overlapping part must be reported as a reserved area in
> that particular TDMR.
> 
> Use alloc_contig_pages() since PAMT must be a physically contiguous area
> and it may be potentially large (~1/256th of the size of the given TDMR).
> The downside is alloc_contig_pages() may fail at runtime.  One (bad)
> mitigation is to launch a TD guest early during system boot to get those
> PAMTs allocated at early time, but the only way to fix is to add a boot
> option to allocate or reserve PAMTs during kernel boot.

FWIW, we all agree that this is a bad permanent way to leave things.
You can call me out here as proposing that this wart be left in place
while this series is merged and is a detail we can work on afterword
with new module params, boot options, Kconfig or whatever.

> TDX only supports a limited number of reserved areas per TDMR to cover
> both PAMTs and memory holes within the given TDMR.  If many PAMTs are
> allocated within a single TDMR, the reserved areas may not be sufficient
> to cover all of them.
> 
> Adopt the following policies when allocating PAMTs for a given TDMR:
> 
>   - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
>     the total number of reserved areas consumed for PAMTs.
>   - Try to first allocate PAMT from the local node of the TDMR for better
>     NUMA locality.
> 
> Also dump out how many pages are allocated for PAMTs when the TDX module
> is initialized successfully.

... this helps answer the eternal "where did all my memory go?" questions.

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b36129183035..b86a333b860f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1960,6 +1960,7 @@ config INTEL_TDX_HOST
>  	depends on KVM_INTEL
>  	depends on X86_X2APIC
>  	select ARCH_KEEP_MEMBLOCK
> +	depends on CONTIG_ALLOC
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 57b448de59a0..9d76e70de46e 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -586,6 +586,187 @@ static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
>  	return 0;
>  }
>  
> +/*
> + * Calculate PAMT size given a TDMR and a page size.  The returned
> + * PAMT size is always aligned up to 4K page boundary.
> + */
> +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz)
> +{
> +	unsigned long pamt_sz, nr_pamt_entries;
> +
> +	switch (pgsz) {
> +	case TDX_PS_4K:
> +		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
> +		break;
> +	case TDX_PS_2M:
> +		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
> +		break;
> +	case TDX_PS_1G:
> +		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return 0;
> +	}
> +
> +	pamt_sz = nr_pamt_entries * tdx_sysinfo.pamt_entry_size;
> +	/* TDX requires PAMT size must be 4K aligned */
> +	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> +
> +	return pamt_sz;
> +}
> +
> +/*
> + * Pick a NUMA node on which to allocate this TDMR's metadata.
> + *
> + * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
> + * not be.  If the TDMR covers more than one node, just use the _first_
> + * one.  This can lead to small areas of off-node metadata for some
> + * memory.
> + */
> +static int tdmr_get_nid(struct tdmr_info *tdmr)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	/* Find the first memory region covered by the TDMR */
> +	list_for_each_entry(tmb, &tdx_memlist, list) {
> +		if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> +			return tmb->nid;
> +	}

Aha, the first use of tmb->nid!  I wondered why that was there.

> +
> +	/*
> +	 * Fall back to allocating the TDMR's metadata from node 0 when
> +	 * no TDX memory block can be found.  This should never happen
> +	 * since TDMRs originate from TDX memory blocks.
> +	 */
> +	WARN_ON_ONCE(1);

That's probably better a pr_warn() or something.  A backtrace and all
that jazz seems a bit overly dramatic for this.

> +	return 0;
> +}
The rest of this actually looks fine.  It's nearing ack'able state.
Huang, Kai Nov. 24, 2022, 11:46 a.m. UTC | #2
On Wed, 2022-11-23 at 14:57 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > The TDX module uses additional metadata to record things like which
> > guest "owns" a given page of memory.  This metadata, referred as
> > Physical Address Metadata Table (PAMT), essentially serves as the
> > 'struct page' for the TDX module.  PAMTs are not reserved by hardware
> > up front.  They must be allocated by the kernel and then given to the
> > TDX module.
> 
> ... during module initialization.

Thanks.

> 
> > TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
> > (TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
> > be a physically contiguous area from a Convertible Memory Region (CMR).
> > However, the PAMTs which track pages in one TDMR do not need to reside
> > within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
> > any TDMR, the overlapping part must be reported as a reserved area in
> > that particular TDMR.
> > 
> > Use alloc_contig_pages() since PAMT must be a physically contiguous area
> > and it may be potentially large (~1/256th of the size of the given TDMR).
> > The downside is alloc_contig_pages() may fail at runtime.  One (bad)
> > mitigation is to launch a TD guest early during system boot to get those
> > PAMTs allocated at early time, but the only way to fix is to add a boot
> > option to allocate or reserve PAMTs during kernel boot.
> 
> FWIW, we all agree that this is a bad permanent way to leave things.
> You can call me out here as proposing that this wart be left in place
> while this series is merged and is a detail we can work on afterword
> with new module params, boot options, Kconfig or whatever.

Sorry do you mean to call out in the cover letter, or in this changelog?

> > TDX only supports a limited number of reserved areas per TDMR to cover
> > both PAMTs and memory holes within the given TDMR.  If many PAMTs are
> > allocated within a single TDMR, the reserved areas may not be sufficient
> > to cover all of them.
> > 
> > Adopt the following policies when allocating PAMTs for a given TDMR:
> > 
> >   - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
> >     the total number of reserved areas consumed for PAMTs.
> >   - Try to first allocate PAMT from the local node of the TDMR for better
> >     NUMA locality.
> > 
> > Also dump out how many pages are allocated for PAMTs when the TDX module
> > is initialized successfully.
> 
> ... this helps answer the eternal "where did all my memory go?" questions.

Will add to the comment.

[...]


> > +/*
> > + * Pick a NUMA node on which to allocate this TDMR's metadata.
> > + *
> > + * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
> > + * not be.  If the TDMR covers more than one node, just use the _first_
> > + * one.  This can lead to small areas of off-node metadata for some
> > + * memory.
> > + */
> > +static int tdmr_get_nid(struct tdmr_info *tdmr)
> > +{
> > +	struct tdx_memblock *tmb;
> > +
> > +	/* Find the first memory region covered by the TDMR */
> > +	list_for_each_entry(tmb, &tdx_memlist, list) {
> > +		if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> > +			return tmb->nid;
> > +	}
> 
> Aha, the first use of tmb->nid!  I wondered why that was there.

As you suggested I'll introduce the nid member of 'tdx_memblock' in this patch. 

> 
> > +
> > +	/*
> > +	 * Fall back to allocating the TDMR's metadata from node 0 when
> > +	 * no TDX memory block can be found.  This should never happen
> > +	 * since TDMRs originate from TDX memory blocks.
> > +	 */
> > +	WARN_ON_ONCE(1);
> 
> That's probably better a pr_warn() or something.  A backtrace and all
> that jazz seems a bit overly dramatic for this.

How about below?

pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT
allocation, fallback to use node 0.\n");
Dave Hansen Nov. 28, 2022, 4:39 p.m. UTC | #3
On 11/24/22 03:46, Huang, Kai wrote:
> On Wed, 2022-11-23 at 14:57 -0800, Dave Hansen wrote:
>> On 11/20/22 16:26, Kai Huang wrote:
>>> Use alloc_contig_pages() since PAMT must be a physically contiguous area
>>> and it may be potentially large (~1/256th of the size of the given TDMR).
>>> The downside is alloc_contig_pages() may fail at runtime.  One (bad)
>>> mitigation is to launch a TD guest early during system boot to get those
>>> PAMTs allocated at early time, but the only way to fix is to add a boot
>>> option to allocate or reserve PAMTs during kernel boot.
>>
>> FWIW, we all agree that this is a bad permanent way to leave things.
>> You can call me out here as proposing that this wart be left in place
>> while this series is merged and is a detail we can work on afterword
>> with new module params, boot options, Kconfig or whatever.
> 
> Sorry do you mean to call out in the cover letter, or in this changelog?

Cover letter would be best.  But, a note in the changelog that it is
imperfect and will be improved on later would also be nice.

>>> +   /*
>>> +    * Fall back to allocating the TDMR's metadata from node 0 when
>>> +    * no TDX memory block can be found.  This should never happen
>>> +    * since TDMRs originate from TDX memory blocks.
>>> +    */
>>> +   WARN_ON_ONCE(1);
>>
>> That's probably better a pr_warn() or something.  A backtrace and all
>> that jazz seems a bit overly dramatic for this.
> 
> How about below?
> 
> pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT
> allocation, fallback to use node 0.\n");

I actually try to make these somewhat mirror the code.  For instance, if
you are searching using *just* the start TDMR address, then the message
should only talk about the start address.  Also, it's not trying to find
a *node* per se.  It's trying to find a 'tmb'.  So, if someone wanted to
debug this problem, they would actually want to dump out the tmbs.

But, back to the loop that this message describes:

> +	/* Find the first memory region covered by the TDMR */
> +	list_for_each_entry(tmb, &tdx_memlist, list) {
> +		if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> +			return tmb->nid;
> +	}

That loop is funky.  It's not obvious at *all* why it even works.

1. A 'tmb' describes "real" memory.  It never covers holes.
2. This code is trying to find *a* 'tmb' to place a structure in.  It
   needs real memory to place this, of course.
3. A 'tdmr' may include holes and may not have a 'tmb' at either its
   start or end address
4. A 'tdmr' is expected to cover one or more 'tmb's.  If there were no
   'tmb's, then the TDMR is going to be marked as all reserved and is
   effectively being wasted.
5. A 'tdmr' may cover more than one NUMA node.  If this happens, it is
   ok to get memory from any of those nodes for that tdmr's PAMT.

I'd include this comment on the loop:

	A TDMR must cover at least part of one TMB.  That TMB will end
	after the TDMR begins.  But, that TMB may have started before
	the TDMR.  Find the next 'tmb' that _ends_ after this TDMR
	begins.  Ignore 'tmb' start addresses.  They are irrelevant.

Maybe even a little ASCII diagram about the different tmb configurations
that this can find:

| TDMR1 | TDMR2 |
   |---tmb---|
          |tmb|
          |------tmb-------|
   |------tmb-------|

I'd also include this on the function:

/*
 * Locate a NUMA node which should hold the allocation of the @tdmr
 * PAMT.  This node will have some memory covered by the TDMR.  The
 * relative amount of memory covered is not considered.
 */
Huang, Kai Nov. 28, 2022, 10:48 p.m. UTC | #4
On Mon, 2022-11-28 at 08:39 -0800, Dave Hansen wrote:
> On 11/24/22 03:46, Huang, Kai wrote:
> > On Wed, 2022-11-23 at 14:57 -0800, Dave Hansen wrote:
> > > On 11/20/22 16:26, Kai Huang wrote:
> > > > Use alloc_contig_pages() since PAMT must be a physically contiguous area
> > > > and it may be potentially large (~1/256th of the size of the given TDMR).
> > > > The downside is alloc_contig_pages() may fail at runtime.  One (bad)
> > > > mitigation is to launch a TD guest early during system boot to get those
> > > > PAMTs allocated at early time, but the only way to fix is to add a boot
> > > > option to allocate or reserve PAMTs during kernel boot.
> > > 
> > > FWIW, we all agree that this is a bad permanent way to leave things.
> > > You can call me out here as proposing that this wart be left in place
> > > while this series is merged and is a detail we can work on afterword
> > > with new module params, boot options, Kconfig or whatever.
> > 
> > Sorry do you mean to call out in the cover letter, or in this changelog?
> 
> Cover letter would be best.  But, a note in the changelog that it is
> imperfect and will be improved on later would also be nice.

Thanks will do both.

> 
> > > > +   /*
> > > > +    * Fall back to allocating the TDMR's metadata from node 0 when
> > > > +    * no TDX memory block can be found.  This should never happen
> > > > +    * since TDMRs originate from TDX memory blocks.
> > > > +    */
> > > > +   WARN_ON_ONCE(1);
> > > 
> > > That's probably better a pr_warn() or something.  A backtrace and all
> > > that jazz seems a bit overly dramatic for this.
> > 
> > How about below?
> > 
> > pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT
> > allocation, fallback to use node 0.\n");
> 
> I actually try to make these somewhat mirror the code.  For instance, if
> you are searching using *just* the start TDMR address, then the message
> should only talk about the start address.  Also, it's not trying to find
> a *node* per se.  It's trying to find a 'tmb'.  So, if someone wanted to
> debug this problem, they would actually want to dump out the tmbs.
> 
> But, back to the loop that this message describes:
> 
> > +	/* Find the first memory region covered by the TDMR */
> > +	list_for_each_entry(tmb, &tdx_memlist, list) {
> > +		if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
> > +			return tmb->nid;
> > +	}
> 
> That loop is funky.  It's not obvious at *all* why it even works.
> 
> 1. A 'tmb' describes "real" memory.  It never covers holes.
> 2. This code is trying to find *a* 'tmb' to place a structure in.  It
>    needs real memory to place this, of course.
> 3. A 'tdmr' may include holes and may not have a 'tmb' at either its
>    start or end address
> 4. A 'tdmr' is expected to cover one or more 'tmb's.  If there were no
>    'tmb's, then the TDMR is going to be marked as all reserved and is
>    effectively being wasted.
> 5. A 'tdmr' may cover more than one NUMA node.  If this happens, it is
>    ok to get memory from any of those nodes for that tdmr's PAMT.

Right.

> 
> I'd include this comment on the loop:
> 
> 	A TDMR must cover at least part of one TMB.  That TMB will end
> 	after the TDMR begins.  But, that TMB may have started before
> 	the TDMR.  Find the next 'tmb' that _ends_ after this TDMR
> 	begins.  Ignore 'tmb' start addresses.  They are irrelevant.

Thanks.  Will do.

However, I am not sure I quite understand "the next 'tmb'" part?

> 
> Maybe even a little ASCII diagram about the different tmb configurations
> that this can find:
> 
> > TDMR1 | TDMR2 |
>    |---tmb---|			
>           |tmb|
>           |------tmb-------|		<- case 3)
>    |------tmb-------|			<- case 4

Thanks for the diagram!

But IIUC it seems the above case 3) and 4) are actually not possible, since when
one TDMR is created, it's end is always rounded up to the end of TMB it tries to
cover (the rounded-up end may cover the entire or only partial of other TMBs,
though).

			1G		2G
	TDMR1		|	TDMR2	|

	|--tmb1--|  |--tmb2--|  |-tmb3-|

	node 0		     |  node 1

> 
> I'd also include this on the function:
> 
> /*
>  * Locate a NUMA node which should hold the allocation of the @tdmr
>  * PAMT.  This node will have some memory covered by the TDMR.  The
>  * relative amount of memory covered is not considered.
>  */
> 
> 

Thanks.  Will do.
Dave Hansen Nov. 28, 2022, 10:56 p.m. UTC | #5
On 11/28/22 14:48, Huang, Kai wrote:
>> Maybe even a little ASCII diagram about the different tmb configurations
>> that this can find:
>>
>>> TDMR1 | TDMR2 |
>>    |---tmb---|
>>           |tmb|
>>           |------tmb-------|          <- case 3)
>>    |------tmb-------|                 <- case 4
> Thanks for the diagram!
> 
> But IIUC it seems the above case 3) and 4) are actually not possible, since when
> one TDMR is created, it's end is always rounded up to the end of TMB it tries to
> cover (the rounded-up end may cover the entire or only partial of other TMBs,
> though).

OK, but at the same time, we shouldn't *STRICTLY* specialize every
single little chunk of this code to be aware of every other tiny little
implementation detail.

Let's say tomorrow's code has lots of TDMRs left, but fills up one
TDMR's reserved areas and has to "split" it.  Want to bet on whether the
person that adds that patch will be able to find this code and fix it up?

Or, say that the TDMR creation algorithm changes and they're not done in
order of ascending physical address.

This code actually gets easier and more obvious if you ignore the other
details.
Huang, Kai Nov. 28, 2022, 11:14 p.m. UTC | #6
On Mon, 2022-11-28 at 14:56 -0800, Hansen, Dave wrote:
> On 11/28/22 14:48, Huang, Kai wrote:
> > > Maybe even a little ASCII diagram about the different tmb configurations
> > > that this can find:
> > > 
> > > > TDMR1 | TDMR2 |
> > >    |---tmb---|
> > >           |tmb|
> > >           |------tmb-------|          <- case 3)
> > >    |------tmb-------|                 <- case 4
> > Thanks for the diagram!
> > 
> > But IIUC it seems the above case 3) and 4) are actually not possible, since when
> > one TDMR is created, it's end is always rounded up to the end of TMB it tries to
> > cover (the rounded-up end may cover the entire or only partial of other TMBs,
> > though).
> 
> OK, but at the same time, we shouldn't *STRICTLY* specialize every
> single little chunk of this code to be aware of every other tiny little
> implementation detail.
> 
> Let's say tomorrow's code has lots of TDMRs left, but fills up one
> TDMR's reserved areas and has to "split" it.  Want to bet on whether the
> person that adds that patch will be able to find this code and fix it up?

Yeah good point.

> 
> Or, say that the TDMR creation algorithm changes and they're not done in
> order of ascending physical address.
> 
> This code actually gets easier and more obvious if you ignore the other
> details.

Agreed.  Thanks.
diff mbox series

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b36129183035..b86a333b860f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1960,6 +1960,7 @@  config INTEL_TDX_HOST
 	depends on KVM_INTEL
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
+	depends on CONTIG_ALLOC
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 57b448de59a0..9d76e70de46e 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -586,6 +586,187 @@  static int create_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
 	return 0;
 }
 
+/*
+ * Calculate PAMT size given a TDMR and a page size.  The returned
+ * PAMT size is always aligned up to 4K page boundary.
+ */
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz)
+{
+	unsigned long pamt_sz, nr_pamt_entries;
+
+	switch (pgsz) {
+	case TDX_PS_4K:
+		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
+		break;
+	case TDX_PS_2M:
+		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
+		break;
+	case TDX_PS_1G:
+		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return 0;
+	}
+
+	pamt_sz = nr_pamt_entries * tdx_sysinfo.pamt_entry_size;
+	/* TDX requires PAMT size must be 4K aligned */
+	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
+
+	return pamt_sz;
+}
+
+/*
+ * Pick a NUMA node on which to allocate this TDMR's metadata.
+ *
+ * This is imprecise since TDMRs are 1G aligned and NUMA nodes might
+ * not be.  If the TDMR covers more than one node, just use the _first_
+ * one.  This can lead to small areas of off-node metadata for some
+ * memory.
+ */
+static int tdmr_get_nid(struct tdmr_info *tdmr)
+{
+	struct tdx_memblock *tmb;
+
+	/* Find the first memory region covered by the TDMR */
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		if (tmb->end_pfn > (tdmr_start(tdmr) >> PAGE_SHIFT))
+			return tmb->nid;
+	}
+
+	/*
+	 * Fall back to allocating the TDMR's metadata from node 0 when
+	 * no TDX memory block can be found.  This should never happen
+	 * since TDMRs originate from TDX memory blocks.
+	 */
+	WARN_ON_ONCE(1);
+	return 0;
+}
+
+static int tdmr_set_up_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_base[TDX_PS_1G + 1];
+	unsigned long pamt_size[TDX_PS_1G + 1];
+	unsigned long tdmr_pamt_base;
+	unsigned long tdmr_pamt_size;
+	struct page *pamt;
+	int pgsz, nid;
+
+	nid = tdmr_get_nid(tdmr);
+
+	/*
+	 * Calculate the PAMT size for each TDX supported page size
+	 * and the total PAMT size.
+	 */
+	tdmr_pamt_size = 0;
+	for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) {
+		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz);
+		tdmr_pamt_size += pamt_size[pgsz];
+	}
+
+	/*
+	 * Allocate one chunk of physically contiguous memory for all
+	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
+	 * in overlapped TDMRs.
+	 */
+	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
+			nid, &node_online_map);
+	if (!pamt)
+		return -ENOMEM;
+
+	/*
+	 * Break the contiguous allocation back up into the
+	 * individual PAMTs for each page size.
+	 */
+	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
+	for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G; pgsz++) {
+		pamt_base[pgsz] = tdmr_pamt_base;
+		tdmr_pamt_base += pamt_size[pgsz];
+	}
+
+	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
+	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
+	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
+	tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
+	tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
+	tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
+
+	return 0;
+}
+
+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn,
+			  unsigned long *pamt_npages)
+{
+	unsigned long pamt_base, pamt_sz;
+
+	/*
+	 * The PAMT was allocated in one contiguous unit.  The 4K PAMT
+	 * should always point to the beginning of that allocation.
+	 */
+	pamt_base = tdmr->pamt_4k_base;
+	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
+
+	*pamt_pfn = pamt_base >> PAGE_SHIFT;
+	*pamt_npages = pamt_sz >> PAGE_SHIFT;
+}
+
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_pfn, pamt_npages;
+
+	tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages);
+
+	/* Do nothing if PAMT hasn't been allocated for this TDMR */
+	if (!pamt_npages)
+		return;
+
+	if (WARN_ON_ONCE(!pamt_pfn))
+		return;
+
+	free_contig_range(pamt_pfn, pamt_npages);
+}
+
+static void tdmrs_free_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
+{
+	int i;
+
+	for (i = 0; i < tdmr_num; i++)
+		tdmr_free_pamt(tdmr_array_entry(tdmr_array, i));
+}
+
+/* Allocate and set up PAMTs for all TDMRs */
+static int tdmrs_set_up_pamt_all(struct tdmr_info *tdmr_array, int tdmr_num)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < tdmr_num; i++) {
+		ret = tdmr_set_up_pamt(tdmr_array_entry(tdmr_array, i));
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+	return ret;
+}
+
+static unsigned long tdmrs_count_pamt_pages(struct tdmr_info *tdmr_array,
+					  int tdmr_num)
+{
+	unsigned long pamt_npages = 0;
+	int i;
+
+	for (i = 0; i < tdmr_num; i++) {
+		unsigned long pfn, npages;
+
+		tdmr_get_pamt(tdmr_array_entry(tdmr_array, i), &pfn, &npages);
+		pamt_npages += npages;
+	}
+
+	return pamt_npages;
+}
+
 /*
  * Construct an array of TDMRs to cover all TDX memory ranges.
  * The actual number of TDMRs is kept to @tdmr_num.
@@ -598,8 +779,13 @@  static int construct_tdmrs(struct tdmr_info *tdmr_array, int *tdmr_num)
 	if (ret)
 		goto err;
 
+	ret = tdmrs_set_up_pamt_all(tdmr_array, *tdmr_num);
+	if (ret)
+		goto err;
+
 	/* Return -EINVAL until constructing TDMRs is done */
 	ret = -EINVAL;
+	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
 err:
 	return ret;
 }
@@ -686,6 +872,11 @@  static int init_tdx_module(void)
 	 * process are done.
 	 */
 	ret = -EINVAL;
+	if (ret)
+		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+	else
+		pr_info("%lu pages allocated for PAMT.\n",
+				tdmrs_count_pamt_pages(tdmr_array, tdmr_num));
 out_free_tdmrs:
 	/*
 	 * The array of TDMRs is freed no matter the initialization is