diff mbox series

[v9,07/18] x86/virt/tdx: Do TDX module per-cpu initialization

Message ID 557c526a1190903d11d67c4e2c76e01f67f6eb15.1676286526.git.kai.huang@intel.com (mailing list archive)
State New
Headers show
Series TDX host kernel support | expand

Commit Message

Huang, Kai Feb. 13, 2023, 11:59 a.m. UTC
After the SEAMCALL to do TDX module global initialization, a SEAMCALL to
do per-cpu initialization (TDH.SYS.LP.INIT) must be done on one logical
cpu before any other SEAMCALLs can be made on that cpu, including those
involved in the future steps of the module initialization.

To keep things simple, this implementation just chooses to guarantee all
online cpus are "TDX-runnable" (TDH.SYS.LP.INIT has been successfully
done on them).  If the kernel were to allow one cpu to be online while
TDH.SYS.LP.INIT failed on it, the kernel would need to track a cpumask
of "TDX-runnable" cpus, know which task is "TDX workload" and guarantee
such task can only be scheduled to "TDX-runnable" cpus.  For example,
the kernel would need to reject in sched_setaffinity() if the userspace
tries to bind TDX task to any "non-TDX-runnable" cpu.

To guarantee all online cpus are "TDX-runnable", disable the CPU hotplug
during module initialization and do TDH.SYS.LP.INIT for all online cpus
before any further steps of module initialization.  In CPU hotplug, do
TDH.SYS.LP.INIT when TDX has been enabled in the CPU online callback and
reject to online the cpu if the SEAMCALL fails.

Currently only KVM handles VMXON.  Similar to tdx_enable(), only provide
a new helper tdx_cpu_online() but make KVM itself responsible for doing
VMXON and calling tdx_cpu_online() in its own CPU online callback.

Note tdx_enable() can be called multiple times by KVM because KVM module
can be unloaded and reloaded.  New cpus may become online while KVM is
unloaded, and in this case TDH.SYS.LP.INIT won't be called for those new
online cpus because KVM's CPU online callback is removed when KVM is
unloaded.  To make sure all online cpus are "TDX-runnable", always do
the per-cpu initialization for all online cpus in tdx_enable() even the
module has been initialized.

Similar to the per-cpu module initialization, a later step to config the
key for the global KeyID needs to call some SEAMCALL on one cpu for each
CPU package.  The difference is that SEAMCALL cannot run in parallel on
different cpus but TDH.SYS.LP.INIT can.  To avoid duplicated code, add a
helper to call SEAMCALL on all online cpus one by one but with a skip
function to check whether to skip certain cpus, and use that helper to
do the per-cpu initialization.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v8 -> v9:
 - Added this patch back.
 - Handled the relaxed new behaviour of TDH.SYS.LP.INIT

---
 arch/x86/include/asm/tdx.h  |   2 +
 arch/x86/virt/vmx/tdx/tdx.c | 210 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   1 +
 3 files changed, 208 insertions(+), 5 deletions(-)

Comments

Dave Hansen Feb. 13, 2023, 5:59 p.m. UTC | #1
On 2/13/23 03:59, Kai Huang wrote:
> To avoid duplicated code, add a
> helper to call SEAMCALL on all online cpus one by one but with a skip
> function to check whether to skip certain cpus, and use that helper to
> do the per-cpu initialization.
...
> +/*
> + * Call @func on all online cpus one by one but skip those cpus
> + * when @skip_func is valid and returns true for them.
> + */
> +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data,
> +				bool (*skip_func)(int cpu, void *),
> +				void *skip_data)

I only see one caller of this.  Where is the duplicated code?
Dave Hansen Feb. 13, 2023, 6:07 p.m. UTC | #2
On 2/13/23 03:59, Kai Huang wrote:
> @@ -247,8 +395,17 @@ int tdx_enable(void)
>  		ret = __tdx_enable();
>  		break;
>  	case TDX_MODULE_INITIALIZED:
> -		/* Already initialized, great, tell the caller. */
> -		ret = 0;
> +		/*
> +		 * The previous call of __tdx_enable() may only have
> +		 * initialized part of present cpus during module
> +		 * initialization, and new cpus may have become online
> +		 * since then.
> +		 *
> +		 * To make sure all online cpus are TDX-runnable, always
> +		 * do per-cpu initialization for all online cpus here
> +		 * even the module has been initialized.
> +		 */
> +		ret = __tdx_enable_online_cpus();

I'm missing something here.  CPUs get initialized through either:

 1. __tdx_enable(), for the CPUs around at the time
 2. tdx_cpu_online(), for hotplugged CPUs after __tdx_enable()

But, this is a third class.  CPUs that came online after #1, but which
got missed by #2.  How can that happen?
Huang, Kai Feb. 13, 2023, 9:13 p.m. UTC | #3
> On 2/13/23 03:59, Kai Huang wrote:
> > @@ -247,8 +395,17 @@ int tdx_enable(void)
> >  		ret = __tdx_enable();
> >  		break;
> >  	case TDX_MODULE_INITIALIZED:
> > -		/* Already initialized, great, tell the caller. */
> > -		ret = 0;
> > +		/*
> > +		 * The previous call of __tdx_enable() may only have
> > +		 * initialized part of present cpus during module
> > +		 * initialization, and new cpus may have become online
> > +		 * since then.
> > +		 *
> > +		 * To make sure all online cpus are TDX-runnable, always
> > +		 * do per-cpu initialization for all online cpus here
> > +		 * even the module has been initialized.
> > +		 */
> > +		ret = __tdx_enable_online_cpus();
> 
> I'm missing something here.  CPUs get initialized through either:
> 
>  1. __tdx_enable(), for the CPUs around at the time  2. tdx_cpu_online(), for
> hotplugged CPUs after __tdx_enable()
> 
> But, this is a third class.  CPUs that came online after #1, but which got missed
> by #2.  How can that happen?

(Replying via Microsoft Outlook cause my Evolution suddenly stopped to work after updating the Fedora).

Currently we depend on KVM's CPU hotplug callback to call tdx_cpu_online().  The problem is the KVM's callback can go away when KVM module gets unloaded.

For example:

	1) KVM module loaded when CPU 0, 1, 2 are online, CPU 3, 4, 5 are offline.
	2)  __tdx_enable() gets called.  LP.INIT are done on CPU 0, 1, 2.
	3) KVM gets unloaded.  It's CPU hotplug callbacks are removed too.
	4) CPU 3 becomes online.  In this case, tdx_cpu_online() is not called for it as the KVM's CPU hotplug callback is gone.

So later if KVM gets loaded again, we need to go through __tdx_enable_online_cpus() to do LP.INIT for CPU 3 as it's already online.

Perhaps I didn't explain clearly in the comment.  Below is the updated one:

		/*
		 * The previous call of __tdx_enable() may only have
		 * initialized part of present cpus during module
		 * initialization, and new cpus may have become online
		 * since then w/o doing per-cpu initialization.
		 * 
		 * For example, a new CPU can become online when KVM is
		 * unloaded, in which case tdx_cpu_enable() is not called since
		 * KVM's CPU online callback has been removed.
		 *
		 * To make sure all online cpus are TDX-runnable, always
		 * do per-cpu initialization for all online cpus here
		 * even the module has been initialized.
		 */
Huang, Kai Feb. 13, 2023, 9:19 p.m. UTC | #4
> On 2/13/23 03:59, Kai Huang wrote:
> > To avoid duplicated code, add a
> > helper to call SEAMCALL on all online cpus one by one but with a skip
> > function to check whether to skip certain cpus, and use that helper to
> > do the per-cpu initialization.
> ...
> > +/*
> > + * Call @func on all online cpus one by one but skip those cpus
> > + * when @skip_func is valid and returns true for them.
> > + */
> > +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data,
> > +				bool (*skip_func)(int cpu, void *),
> > +				void *skip_data)
> 
> I only see one caller of this.  Where is the duplicated code?

The other caller is in patch 15 (x86/virt/tdx: Configure global KeyID on all packages).

I kinda mentioned this in the changelog:

	" Similar to the per-cpu module initialization, a later step to config the key for the global KeyID..."

If we don't have this helper, then we can end up with having below loop in two functions:

	for_each_online(cpu) {
		if (should_skip(cpu))
			continue;

		// call @func on @cpu.
	}
Dave Hansen Feb. 13, 2023, 10:28 p.m. UTC | #5
On 2/13/23 13:13, Huang, Kai wrote:
> Perhaps I didn't explain clearly in the comment.  Below is the updated one:
> 
>                 /*
>                  * The previous call of __tdx_enable() may only have
>                  * initialized part of present cpus during module
>                  * initialization, and new cpus may have become online
>                  * since then w/o doing per-cpu initialization.
>                  *
>                  * For example, a new CPU can become online when KVM is
>                  * unloaded, in which case tdx_cpu_enable() is not called since
>                  * KVM's CPU online callback has been removed.
>                  *
>                  * To make sure all online cpus are TDX-runnable, always
>                  * do per-cpu initialization for all online cpus here
>                  * even the module has been initialized.
>                  */

This is voodoo.

I want a TDX-specific hotplug CPU handler.  Period.  Please make that
happen.  Put that code in this patch.  That handler should:

	1. Run after the KVM handler (if present)
	2. See if VMX is on
	3. If VMX is on:
	 3a. Run smp_func_module_lp_init(), else
	 3b. Mark the CPU as needing smp_func_module_lp_init()

Then, in the 'case TDX_MODULE_INITIALIZED:', you call a function to
iterate over the cpumask that was generated in 3b.

That makes the handoff *EXPLICIT*.  You know exactly which CPUs need
what done to them.  A CPU hotplug either explicitly involves doing the
work to make TDX work on the CPU, or explicitly defers the work to a
specific later time in a specific later piece of code.
Dave Hansen Feb. 13, 2023, 10:43 p.m. UTC | #6
On 2/13/23 13:19, Huang, Kai wrote:
>> On 2/13/23 03:59, Kai Huang wrote:
>>> To avoid duplicated code, add a
>>> helper to call SEAMCALL on all online cpus one by one but with a skip
>>> function to check whether to skip certain cpus, and use that helper to
>>> do the per-cpu initialization.
>> ...
>>> +/*
>>> + * Call @func on all online cpus one by one but skip those cpus
>>> + * when @skip_func is valid and returns true for them.
>>> + */
>>> +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data,
>>> +                           bool (*skip_func)(int cpu, void *),
>>> +                           void *skip_data)
>> I only see one caller of this.  Where is the duplicated code?
> The other caller is in patch 15 (x86/virt/tdx: Configure global KeyID on all packages).
> 
> I kinda mentioned this in the changelog:
> 
>         " Similar to the per-cpu module initialization, a later step to config the key for the global KeyID..."
> 
> If we don't have this helper, then we can end up with having below loop in two functions:
> 
>         for_each_online(cpu) {
>                 if (should_skip(cpu))
>                         continue;
> 
>                 // call @func on @cpu.
>         }

I don't think saving two lines of actual code is worth the opacity that
results from this abstraction.
Huang, Kai Feb. 13, 2023, 11:43 p.m. UTC | #7
On Mon, 2023-02-13 at 14:28 -0800, Dave Hansen wrote:
> On 2/13/23 13:13, Huang, Kai wrote:
> > Perhaps I didn't explain clearly in the comment.  Below is the updated one:
> > 
> >                 /*
> >                  * The previous call of __tdx_enable() may only have
> >                  * initialized part of present cpus during module
> >                  * initialization, and new cpus may have become online
> >                  * since then w/o doing per-cpu initialization.
> >                  *
> >                  * For example, a new CPU can become online when KVM is
> >                  * unloaded, in which case tdx_cpu_enable() is not called since
> >                  * KVM's CPU online callback has been removed.
> >                  *
> >                  * To make sure all online cpus are TDX-runnable, always
> >                  * do per-cpu initialization for all online cpus here
> >                  * even the module has been initialized.
> >                  */
> 
> This is voodoo.
> 
> I want a TDX-specific hotplug CPU handler.  Period.  Please make that
> happen.  
> 

Yes 100% agreed.

> Put that code in this patch.  That handler should:
> 
> 	1. Run after the KVM handler (if present)
> 	2. See if VMX is on
> 	3. If VMX is on:
> 	 3a. Run smp_func_module_lp_init(), else
> 	 3b. Mark the CPU as needing smp_func_module_lp_init()
> 
> Then, in the 'case TDX_MODULE_INITIALIZED:', you call a function to
> iterate over the cpumask that was generated in 3b.
> 
> That makes the handoff *EXPLICIT*.  You know exactly which CPUs need
> what done to them.  A CPU hotplug either explicitly involves doing the
> work to make TDX work on the CPU, or explicitly defers the work to a
> specific later time in a specific later piece of code.

In 3b. we don't need to "explicitly mark the  CPU as needing
smp_func_module_lp_init()".  We already have __cpu_tdx_mask to track whether
LP.INIT has been done on one cpu and we can use that to determine:

	Any online cpu which isn't set in __cpu_tdx_mask needs to do LP.INIT in
	tdx_enable().

And the function module_lp_init_online_cpus() already handles that, and it can
be called directly in tdx_enable() path (as shown in this patch).

I'll do above as you suggested, but just use __cpu_tdx_mask as explained above.

( My main concern is "Run after the KVM handler" seems a little bit hacky to me.
Logically, it's more reasonable to have the TDX callback _before_ KVM's but not
_after_.  If any user (KVM) has done tdx_enable() successfully, the TDX code
should give the user a "TDX-runnable" cpu before user (KVM)'s own callback is
involved. Anyway as mentioned above, I'll do above as you suggested.)
Dave Hansen Feb. 13, 2023, 11:52 p.m. UTC | #8
On 2/13/23 15:43, Huang, Kai wrote:
> ( My main concern is "Run after the KVM handler" seems a little bit hacky to me.
> Logically, it's more reasonable to have the TDX callback _before_ KVM's but not
> _after_.  If any user (KVM) has done tdx_enable() successfully, the TDX code
> should give the user a "TDX-runnable" cpu before user (KVM)'s own callback is
> involved. Anyway as mentioned above, I'll do above as you suggested.)

I was assuming that the KVM callback is what does VMXON for a given
logical CPU.  If that were the case, you'd need to do the TDX stuff
*AFTER* VMXON.

Am I wrong?
Huang, Kai Feb. 14, 2023, 12:02 a.m. UTC | #9
On Mon, 2023-02-13 at 14:43 -0800, Dave Hansen wrote:
> On 2/13/23 13:19, Huang, Kai wrote:
> > > On 2/13/23 03:59, Kai Huang wrote:
> > > > To avoid duplicated code, add a
> > > > helper to call SEAMCALL on all online cpus one by one but with a skip
> > > > function to check whether to skip certain cpus, and use that helper to
> > > > do the per-cpu initialization.
> > > ...
> > > > +/*
> > > > + * Call @func on all online cpus one by one but skip those cpus
> > > > + * when @skip_func is valid and returns true for them.
> > > > + */
> > > > +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data,
> > > > +                           bool (*skip_func)(int cpu, void *),
> > > > +                           void *skip_data)
> > > I only see one caller of this.  Where is the duplicated code?
> > The other caller is in patch 15 (x86/virt/tdx: Configure global KeyID on all packages).
> > 
> > I kinda mentioned this in the changelog:
> > 
> >         " Similar to the per-cpu module initialization, a later step to config the key for the global KeyID..."
> > 
> > If we don't have this helper, then we can end up with having below loop in two functions:
> > 
> >         for_each_online(cpu) {
> >                 if (should_skip(cpu))
> >                         continue;
> > 
> >                 // call @func on @cpu.
> >         }
> 
> I don't think saving two lines of actual code is worth the opacity that
> results from this abstraction.

Alright thanks for the suggestion.  I'll remove this tdx_on_each_cpu_cond() and
do directly.

But just checking:

LP.INIT can actually be called in parallel on different cpus (doesn't have to,
of course), so we can actually just use on_each_cpu_cond() for LP.INIT:

	on_each_cpu_cond(should_skip_cpu, smp_func_module_lp_init, NULL, true);

But IIUC Peter doesn't like using IPI and prefers using via work:

https://lore.kernel.org/lkml/Y30dujuXC8wlLwoQ@hirez.programming.kicks-ass.net/

So I used smp_call_on_cpu() here, which only calls @func on one cpu, but not a
cpumask.  For LP.INIT ideally we can have something like:

	schedule_on_cpu(struct cpumask *cpus, work_func_t func);

to call @func on a cpu set, but that doesn't exist now, and I don't think it's
worth to introduce it?

So, should I use on_each_cpu_cond(), or use smp_call_on_cpu() here?

(btw for the TDH.SYS.KEY.CONFIG we must do one by one as it cannot run in
parallel on multi cpus, so I'll use smp_call_on_cpu().)
Huang, Kai Feb. 14, 2023, 12:09 a.m. UTC | #10
On Mon, 2023-02-13 at 15:52 -0800, Dave Hansen wrote:
> On 2/13/23 15:43, Huang, Kai wrote:
> > ( My main concern is "Run after the KVM handler" seems a little bit hacky to me.
> > Logically, it's more reasonable to have the TDX callback _before_ KVM's but not
> > _after_.  If any user (KVM) has done tdx_enable() successfully, the TDX code
> > should give the user a "TDX-runnable" cpu before user (KVM)'s own callback is
> > involved. Anyway as mentioned above, I'll do above as you suggested.)
> 
> I was assuming that the KVM callback is what does VMXON for a given
> logical CPU.  If that were the case, you'd need to do the TDX stuff
> *AFTER* VMXON.
> 
> Am I wrong?
> 
> 

You are right.

What I meant was: because we choose to not support VMXON in the (non-KVM)
kernel, we need/have to put TDX's callback after KVM's.  Otherwise, perhaps a
better way is to put TDX's callback before KVM's.  But maybe it's an arguable
"perhaps", so let's just do TDX's callback after KVM's as you suggested.
Peter Zijlstra Feb. 14, 2023, 12:59 p.m. UTC | #11
On Tue, Feb 14, 2023 at 12:59:14AM +1300, Kai Huang wrote:
> +/*
> + * Call @func on all online cpus one by one but skip those cpus
> + * when @skip_func is valid and returns true for them.
> + */
> +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data,
> +				bool (*skip_func)(int cpu, void *),
> +				void *skip_data)
> +{
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		int ret;
> +
> +		if (skip_func && skip_func(cpu, skip_data))
> +			continue;
> +
> +		/*
> +		 * SEAMCALL can be time consuming.  Call the @func on
> +		 * remote cpu via smp_call_on_cpu() instead of
> +		 * smp_call_function_single() to avoid busy waiting.
> +		 */
> +		ret = smp_call_on_cpu(cpu, func, func_data, true);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}

schedule_on_each_cpu() ?
Peter Zijlstra Feb. 14, 2023, 2:12 p.m. UTC | #12
On Tue, Feb 14, 2023 at 12:02:22AM +0000, Huang, Kai wrote:
> On Mon, 2023-02-13 at 14:43 -0800, Dave Hansen wrote:
> > On 2/13/23 13:19, Huang, Kai wrote:
> > > > On 2/13/23 03:59, Kai Huang wrote:
> > > > > To avoid duplicated code, add a
> > > > > helper to call SEAMCALL on all online cpus one by one but with a skip
> > > > > function to check whether to skip certain cpus, and use that helper to
> > > > > do the per-cpu initialization.
> > > > ...
> > > > > +/*
> > > > > + * Call @func on all online cpus one by one but skip those cpus
> > > > > + * when @skip_func is valid and returns true for them.
> > > > > + */
> > > > > +static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data,
> > > > > +                           bool (*skip_func)(int cpu, void *),
> > > > > +                           void *skip_data)
> > > > I only see one caller of this.  Where is the duplicated code?
> > > The other caller is in patch 15 (x86/virt/tdx: Configure global KeyID on all packages).
> > > 
> > > I kinda mentioned this in the changelog:
> > > 
> > >         " Similar to the per-cpu module initialization, a later step to config the key for the global KeyID..."
> > > 
> > > If we don't have this helper, then we can end up with having below loop in two functions:
> > > 
> > >         for_each_online(cpu) {
> > >                 if (should_skip(cpu))
> > >                         continue;
> > > 
> > >                 // call @func on @cpu.
> > >         }
> > 
> > I don't think saving two lines of actual code is worth the opacity that
> > results from this abstraction.
> 
> Alright thanks for the suggestion.  I'll remove this tdx_on_each_cpu_cond() and
> do directly.
> 
> But just checking:
> 
> LP.INIT can actually be called in parallel on different cpus (doesn't have to,
> of course), so we can actually just use on_each_cpu_cond() for LP.INIT:
> 
> 	on_each_cpu_cond(should_skip_cpu, smp_func_module_lp_init, NULL, true);
> 
> But IIUC Peter doesn't like using IPI and prefers using via work:
> 
> https://lore.kernel.org/lkml/Y30dujuXC8wlLwoQ@hirez.programming.kicks-ass.net/
> 
> So I used smp_call_on_cpu() here, which only calls @func on one cpu, but not a
> cpumask.  For LP.INIT ideally we can have something like:
> 
> 	schedule_on_cpu(struct cpumask *cpus, work_func_t func);
> 
> to call @func on a cpu set, but that doesn't exist now, and I don't think it's
> worth to introduce it?

schedule_on_each_cpu() exists and can easily be extended to take a cond
function if you so please.
Peter Zijlstra Feb. 14, 2023, 2:12 p.m. UTC | #13
On Mon, Feb 13, 2023 at 10:07:30AM -0800, Dave Hansen wrote:
> On 2/13/23 03:59, Kai Huang wrote:
> > @@ -247,8 +395,17 @@ int tdx_enable(void)
> >  		ret = __tdx_enable();
> >  		break;
> >  	case TDX_MODULE_INITIALIZED:
> > -		/* Already initialized, great, tell the caller. */
> > -		ret = 0;
> > +		/*
> > +		 * The previous call of __tdx_enable() may only have
> > +		 * initialized part of present cpus during module
> > +		 * initialization, and new cpus may have become online
> > +		 * since then.
> > +		 *
> > +		 * To make sure all online cpus are TDX-runnable, always
> > +		 * do per-cpu initialization for all online cpus here
> > +		 * even the module has been initialized.
> > +		 */
> > +		ret = __tdx_enable_online_cpus();
> 
> I'm missing something here.  CPUs get initialized through either:
> 
>  1. __tdx_enable(), for the CPUs around at the time
>  2. tdx_cpu_online(), for hotplugged CPUs after __tdx_enable()
> 
> But, this is a third class.  CPUs that came online after #1, but which
> got missed by #2.  How can that happen?

offline CPUs, start TDX crap, online CPUs.
Huang, Kai Feb. 14, 2023, 10:53 p.m. UTC | #14
> > 
> > But just checking:
> > 
> > LP.INIT can actually be called in parallel on different cpus (doesn't have to,
> > of course), so we can actually just use on_each_cpu_cond() for LP.INIT:
> > 
> > 	on_each_cpu_cond(should_skip_cpu, smp_func_module_lp_init, NULL, true);
> > 
> > But IIUC Peter doesn't like using IPI and prefers using via work:
> > 
> > https://lore.kernel.org/lkml/Y30dujuXC8wlLwoQ@hirez.programming.kicks-ass.net/
> > 
> > So I used smp_call_on_cpu() here, which only calls @func on one cpu, but not a
> > cpumask.  For LP.INIT ideally we can have something like:
> > 
> > 	schedule_on_cpu(struct cpumask *cpus, work_func_t func);
> > 
> > to call @func on a cpu set, but that doesn't exist now, and I don't think it's
> > worth to introduce it?
> 
> schedule_on_each_cpu() exists and can easily be extended to take a cond
> function if you so please.
> 

Sure.  I just tried to do.  There are two minor things:

1) should I just use smp_cond_func_t directly as the cond function?
2) schedule_on_each_cpu() takes cpus_read_lock() internally.  However in my
case, tdx_enable() already takes that so I need a _locked_ version.

How does below look like? (Not tested)

+/**
+ * schedule_on_each_cpu_cond_locked - execute a function synchronously
+ *                                   on each online CPU for which the
+ *                                   condition function returns positive
+ * @func:      the function to call
+ * @cond_func:  the condition function to call
+ * @cond_data: the data passed to the condition function
+ *
+ * schedule_on_each_cpu_cond_locked() executes @func on each online CPU
+ * when @cond_func returns positive for that cpu, using the system
+ * workqueue and blocks until all CPUs have completed.
+ *
+ * schedule_on_each_cpu_cond_locked() doesn't hold read lock of CPU
+ * hotplug lock but depend on the caller to do.
+ *
+ * schedule_on_each_cpu_cond_locked() is very slow.
+ *
+ * Return:
+ * 0 on success, -errno on failure.
+ */
+int schedule_on_each_cpu_cond_locked(work_func_t func,
+                                    smp_cond_func_t cond_func,
+                                    void *cond_data)
+{
+       int cpu;
+       struct work_struct __percpu *works;
+
+       works = alloc_percpu(struct work_struct);
+       if (!works)
+               return -ENOMEM;
+
+       for_each_online_cpu(cpu) {
+               struct work_struct *work = per_cpu_ptr(works, cpu);
+
+               if (cond_func && !cond_func(cpu, cond_data))
+                       continue;
+
+               INIT_WORK(work, func);
+               schedule_work_on(cpu, work);
+       }
+
+       for_each_online_cpu(cpu)
+               flush_work(per_cpu_ptr(works, cpu));
+
+       free_percpu(works);
+       return 0;
+}
+
+/**
+ * schedule_on_each_cpu_cond - execute a function synchronously on each
+ *                            online CPU for which the condition
+ *                            function returns positive
+ * @func:      the function to call
+ * @cond_func:  the condition function to call
+ * @cond_data: the data passed to the condition function
+ *
+ * schedule_on_each_cpu_cond() executes @func on each online CPU
+ * when @cond_func returns positive for that cpu, using the system
+ * workqueue and blocks until all CPUs have completed.
+ *
+ * schedule_on_each_cpu_cond() is very slow.
+ *
+ * Return:
+ * 0 on success, -errno on failure.
+ */
+int schedule_on_each_cpu_cond(work_func_t func,
+                             smp_cond_func_t cond_func,
+                             void *cond_data)
+{
+       int ret;
+
+       cpus_read_lock();
+
+       ret = schedule_on_each_cpu_cond_locked(func, cond_func, cond_data);
+
+       cpus_read_unlock();
+
+       return ret;
+}
Peter Zijlstra Feb. 15, 2023, 9:16 a.m. UTC | #15
On Tue, Feb 14, 2023 at 10:53:26PM +0000, Huang, Kai wrote:

> Sure.  I just tried to do.  There are two minor things:
> 
> 1) should I just use smp_cond_func_t directly as the cond function?

Yeah, might as well I suppose...

> 2) schedule_on_each_cpu() takes cpus_read_lock() internally.  However in my
> case, tdx_enable() already takes that so I need a _locked_ version.
> 
> How does below look like? (Not tested)
> 
> +/**
> + * schedule_on_each_cpu_cond_locked - execute a function synchronously
> + *                                   on each online CPU for which the
> + *                                   condition function returns positive
> + * @func:      the function to call
> + * @cond_func:  the condition function to call
> + * @cond_data: the data passed to the condition function
> + *
> + * schedule_on_each_cpu_cond_locked() executes @func on each online CPU
> + * when @cond_func returns positive for that cpu, using the system
> + * workqueue and blocks until all CPUs have completed.
> + *
> + * schedule_on_each_cpu_cond_locked() doesn't hold read lock of CPU
> + * hotplug lock but depend on the caller to do.
> + *
> + * schedule_on_each_cpu_cond_locked() is very slow.
> + *
> + * Return:
> + * 0 on success, -errno on failure.
> + */
> +int schedule_on_each_cpu_cond_locked(work_func_t func,
> +                                    smp_cond_func_t cond_func,
> +                                    void *cond_data)
> +{
> +       int cpu;
> +       struct work_struct __percpu *works;
> +
> +       works = alloc_percpu(struct work_struct);
> +       if (!works)
> +               return -ENOMEM;
> +
> +       for_each_online_cpu(cpu) {
> +               struct work_struct *work = per_cpu_ptr(works, cpu);
> +
> +               if (cond_func && !cond_func(cpu, cond_data))
> +                       continue;
> +
> +               INIT_WORK(work, func);
> +               schedule_work_on(cpu, work);
> +       }
> +
> +       for_each_online_cpu(cpu)

I think you need to skip some flushes too. Given we skip setting
work->func, this will go WARN, see __flush_work().

> +               flush_work(per_cpu_ptr(works, cpu));
> +
> +       free_percpu(works);
> +       return 0;
> +}
> +
> +/**
> + * schedule_on_each_cpu_cond - execute a function synchronously on each
> + *                            online CPU for which the condition
> + *                            function returns positive
> + * @func:      the function to call
> + * @cond_func:  the condition function to call
> + * @cond_data: the data passed to the condition function
> + *
> + * schedule_on_each_cpu_cond() executes @func on each online CPU
> + * when @cond_func returns positive for that cpu, using the system
> + * workqueue and blocks until all CPUs have completed.
> + *
> + * schedule_on_each_cpu_cond() is very slow.
> + *
> + * Return:
> + * 0 on success, -errno on failure.
> + */
> +int schedule_on_each_cpu_cond(work_func_t func,
> +                             smp_cond_func_t cond_func,
> +                             void *cond_data)
> +{
> +       int ret;
> +
> +       cpus_read_lock();
> +
> +       ret = schedule_on_each_cpu_cond_locked(func, cond_func, cond_data);
> +
> +       cpus_read_unlock();
> +
> +       return ret;
> +}

Also, re-implement schedule_on_each_cpu() using the above to save a
bunch of duplication:

int schedule_on_each_cpu(work_func_t func)
{
	return schedule_on_each_cpu_cond(func, NULL, NULL);
}


That said, I find it jarring that the schedule_on*() family doesn't have
a void* argument to the function, like the smp_*() family has. So how
about something like the below (equally untested). It preserves the
current semantics, but allows a work function to cast to schedule_work
and access ->info if it so desires.


diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a0143dd24430..5e97111322b2 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -103,6 +103,11 @@ struct work_struct {
 #endif
 };
 
+struct schedule_work {
+	struct work_struct work;
+	void *info;
+};
+
 #define WORK_DATA_INIT()	ATOMIC_LONG_INIT((unsigned long)WORK_STRUCT_NO_POOL)
 #define WORK_DATA_STATIC_INIT()	\
 	ATOMIC_LONG_INIT((unsigned long)(WORK_STRUCT_NO_POOL | WORK_STRUCT_STATIC))
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 07895deca271..c73bb8860bbc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -51,6 +51,7 @@
 #include <linux/sched/isolation.h>
 #include <linux/nmi.h>
 #include <linux/kvm_para.h>
+#include <linux/smp.h>
 
 #include "workqueue_internal.h"
 
@@ -3302,43 +3303,64 @@ bool cancel_delayed_work_sync(struct delayed_work *dwork)
 }
 EXPORT_SYMBOL(cancel_delayed_work_sync);
 
-/**
- * schedule_on_each_cpu - execute a function synchronously on each online CPU
- * @func: the function to call
- *
- * schedule_on_each_cpu() executes @func on each online CPU using the
- * system workqueue and blocks until all CPUs have completed.
- * schedule_on_each_cpu() is very slow.
- *
- * Return:
- * 0 on success, -errno on failure.
- */
-int schedule_on_each_cpu(work_func_t func)
+int schedule_on_each_cpu_cond_locked(work_func_t func, smp_cond_func_t cond_func, void *info)
 {
+	struct schedule_work __percpu *works;
 	int cpu;
-	struct work_struct __percpu *works;
 
-	works = alloc_percpu(struct work_struct);
+	works = alloc_percpu(struct schedule_work);
 	if (!works)
 		return -ENOMEM;
 
-	cpus_read_lock();
-
 	for_each_online_cpu(cpu) {
-		struct work_struct *work = per_cpu_ptr(works, cpu);
+		struct schedule_work *work = per_cpu_ptr(works, cpu);
 
-		INIT_WORK(work, func);
-		schedule_work_on(cpu, work);
+		if (cond_func && !cond_func(cpu, info))
+			continue;
+
+		INIT_WORK(&work->work, func);
+		work->info = info;
+		schedule_work_on(cpu, &work->work);
 	}
 
-	for_each_online_cpu(cpu)
-		flush_work(per_cpu_ptr(works, cpu));
+	for_each_online_cpu(cpu) {
+		struct schedule_work *work = per_cpu_ptr(works, cpu);
+
+		if (work->work.func)
+			flush_work(&work->work);
+	}
 
-	cpus_read_unlock();
 	free_percpu(works);
 	return 0;
 }
 
+int schedule_on_each_cpu_cond(work_func_t func, smp_cond_func_t cond_func, void *info)
+{
+	int ret;
+
+	cpus_read_lock();
+	ret = schedule_on_each_cpu_cond_locked(func, cond, info);
+	cpus_read_unlock();
+
+	return ret;
+}
+
+/**
+ * schedule_on_each_cpu - execute a function synchronously on each online CPU
+ * @func: the function to call
+ *
+ * schedule_on_each_cpu() executes @func on each online CPU using the
+ * system workqueue and blocks until all CPUs have completed.
+ * schedule_on_each_cpu() is very slow.
+ *
+ * Return:
+ * 0 on success, -errno on failure.
+ */
+int schedule_on_each_cpu(work_func_t func)
+{
+	return schedule_on_each_cpu_cond(func, NULL, NULL);
+}
+
 /**
  * execute_in_process_context - reliably execute the routine with user context
  * @fn:		the function to execute
Huang, Kai Feb. 15, 2023, 9:46 a.m. UTC | #16
> > +
> > +       for_each_online_cpu(cpu)
> 
> I think you need to skip some flushes too. Given we skip setting
> work->func, this will go WARN, see __flush_work().

I missed.  Thanks for pointing out.

> 
> > +               flush_work(per_cpu_ptr(works, cpu));
> > +
> 

[...]

> That said, I find it jarring that the schedule_on*() family doesn't have
> a void* argument to the function, like the smp_*() family has. So how
> about something like the below (equally untested). It preserves the
> current semantics, but allows a work function to cast to schedule_work
> and access ->info if it so desires.

Yes agreed.  Your code below looks indeed better.  Thanks!

Would you mind send me a patch so I can include to this series, or would you
mind get it merged to tip/x86/tdx (or other branch I am not sure) so I can
rebase?

Appreciate your help!

> 
> 
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index a0143dd24430..5e97111322b2 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -103,6 +103,11 @@ struct work_struct {
>  #endif
>  };
>  
> +struct schedule_work {
> +	struct work_struct work;
> +	void *info;
> +};
> +
>  #define WORK_DATA_INIT()	ATOMIC_LONG_INIT((unsigned long)WORK_STRUCT_NO_POOL)
>  #define WORK_DATA_STATIC_INIT()	\
>  	ATOMIC_LONG_INIT((unsigned long)(WORK_STRUCT_NO_POOL | WORK_STRUCT_STATIC))
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 07895deca271..c73bb8860bbc 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -51,6 +51,7 @@
>  #include <linux/sched/isolation.h>
>  #include <linux/nmi.h>
>  #include <linux/kvm_para.h>
> +#include <linux/smp.h>
>  
>  #include "workqueue_internal.h"
>  
> @@ -3302,43 +3303,64 @@ bool cancel_delayed_work_sync(struct delayed_work *dwork)
>  }
>  EXPORT_SYMBOL(cancel_delayed_work_sync);
>  
> -/**
> - * schedule_on_each_cpu - execute a function synchronously on each online CPU
> - * @func: the function to call
> - *
> - * schedule_on_each_cpu() executes @func on each online CPU using the
> - * system workqueue and blocks until all CPUs have completed.
> - * schedule_on_each_cpu() is very slow.
> - *
> - * Return:
> - * 0 on success, -errno on failure.
> - */
> -int schedule_on_each_cpu(work_func_t func)
> +int schedule_on_each_cpu_cond_locked(work_func_t func, smp_cond_func_t cond_func, void *info)
>  {
> +	struct schedule_work __percpu *works;
>  	int cpu;
> -	struct work_struct __percpu *works;
>  
> -	works = alloc_percpu(struct work_struct);
> +	works = alloc_percpu(struct schedule_work);
>  	if (!works)
>  		return -ENOMEM;
>  
> -	cpus_read_lock();
> -
>  	for_each_online_cpu(cpu) {
> -		struct work_struct *work = per_cpu_ptr(works, cpu);
> +		struct schedule_work *work = per_cpu_ptr(works, cpu);
>  
> -		INIT_WORK(work, func);
> -		schedule_work_on(cpu, work);
> +		if (cond_func && !cond_func(cpu, info))
> +			continue;
> +
> +		INIT_WORK(&work->work, func);
> +		work->info = info;
> +		schedule_work_on(cpu, &work->work);
>  	}
>  
> -	for_each_online_cpu(cpu)
> -		flush_work(per_cpu_ptr(works, cpu));
> +	for_each_online_cpu(cpu) {
> +		struct schedule_work *work = per_cpu_ptr(works, cpu);
> +
> +		if (work->work.func)
> +			flush_work(&work->work);
> +	}
>  
> -	cpus_read_unlock();
>  	free_percpu(works);
>  	return 0;
>  }
>  
> +int schedule_on_each_cpu_cond(work_func_t func, smp_cond_func_t cond_func, void *info)
> +{
> +	int ret;
> +
> +	cpus_read_lock();
> +	ret = schedule_on_each_cpu_cond_locked(func, cond, info);
> +	cpus_read_unlock();
> +
> +	return ret;
> +}
> +
> +/**
> + * schedule_on_each_cpu - execute a function synchronously on each online CPU
> + * @func: the function to call
> + *
> + * schedule_on_each_cpu() executes @func on each online CPU using the
> + * system workqueue and blocks until all CPUs have completed.
> + * schedule_on_each_cpu() is very slow.
> + *
> + * Return:
> + * 0 on success, -errno on failure.
> + */
> +int schedule_on_each_cpu(work_func_t func)
> +{
> +	return schedule_on_each_cpu_cond(func, NULL, NULL);
> +}
> +
>  /**
>   * execute_in_process_context - reliably execute the routine with user context
>   * @fn:		the function to execute
Peter Zijlstra Feb. 15, 2023, 1:25 p.m. UTC | #17
On Wed, Feb 15, 2023 at 09:46:10AM +0000, Huang, Kai wrote:
> Yes agreed.  Your code below looks indeed better.  Thanks!
> 
> Would you mind send me a patch so I can include to this series, or would you
> mind get it merged to tip/x86/tdx (or other branch I am not sure) so I can
> rebase?

Just take the patch, add your comments and test it.. enjoy! :-)
Huang, Kai Feb. 15, 2023, 9:37 p.m. UTC | #18
On Wed, 2023-02-15 at 14:25 +0100, Peter Zijlstra wrote:
> On Wed, Feb 15, 2023 at 09:46:10AM +0000, Huang, Kai wrote:
> > Yes agreed.  Your code below looks indeed better.  Thanks!
> > 
> > Would you mind send me a patch so I can include to this series, or would you
> > mind get it merged to tip/x86/tdx (or other branch I am not sure) so I can
> > rebase?
> 
> Just take the patch, add your comments and test it.. enjoy! :-)

Thank you! I'll at least add your Suggested-by :)
Huang, Kai March 6, 2023, 2:26 p.m. UTC | #19
On Wed, 2023-02-15 at 21:37 +0000, Huang, Kai wrote:
> On Wed, 2023-02-15 at 14:25 +0100, Peter Zijlstra wrote:
> > On Wed, Feb 15, 2023 at 09:46:10AM +0000, Huang, Kai wrote:
> > > Yes agreed.  Your code below looks indeed better.  Thanks!
> > > 
> > > Would you mind send me a patch so I can include to this series, or would you
> > > mind get it merged to tip/x86/tdx (or other branch I am not sure) so I can
> > > rebase?
> > 
> > Just take the patch, add your comments and test it.. enjoy! :-)
> 
> Thank you! I'll at least add your Suggested-by :)

Hi Peter,

After discussing with Kirill, I changed the way of how to handle the per-cpu
initialization, and in the new version (v10, just sent) I don't need the
schedule_on_each_cpu_cond_locked() anymore, because I essentially moved such
handling out of TDX host series to KVM.  I'll use your patch if the review found
we still need to handle it.  Thanks!
diff mbox series

Patch

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 5c5ecfddb15b..2b2efaa4bc0e 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -107,9 +107,11 @@  static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
 int tdx_enable(void);
+int tdx_cpu_online(unsigned int cpu);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_enable(void)  { return -EINVAL; }
+static inline int tdx_cpu_online(unsigned int cpu) { return 0; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 79cee28c51b5..23b2db28726f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,8 @@ 
 #include <linux/errno.h>
 #include <linux/printk.h>
 #include <linux/mutex.h>
+#include <linux/cpumask.h>
+#include <linux/cpu.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
@@ -26,6 +28,10 @@  static enum tdx_module_status_t tdx_module_status;
 /* Prevent concurrent attempts on TDX module initialization */
 static DEFINE_MUTEX(tdx_module_lock);
 
+/* TDX-runnable cpus.  Protected by cpu_hotplug_lock. */
+static cpumask_t __cpu_tdx_mask;
+static cpumask_t *cpu_tdx_mask = &__cpu_tdx_mask;
+
 /*
  * Use tdx_global_keyid to indicate that TDX is uninitialized.
  * This is used in TDX initialization error paths to take it from
@@ -170,6 +176,63 @@  static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	return ret;
 }
 
+/*
+ * Call @func on all online cpus one by one but skip those cpus
+ * when @skip_func is valid and returns true for them.
+ */
+static int tdx_on_each_cpu_cond(int (*func)(void *), void *func_data,
+				bool (*skip_func)(int cpu, void *),
+				void *skip_data)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		int ret;
+
+		if (skip_func && skip_func(cpu, skip_data))
+			continue;
+
+		/*
+		 * SEAMCALL can be time consuming.  Call the @func on
+		 * remote cpu via smp_call_on_cpu() instead of
+		 * smp_call_function_single() to avoid busy waiting.
+		 */
+		ret = smp_call_on_cpu(cpu, func, func_data, true);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int seamcall_lp_init(void)
+{
+	/* All '0's are just unused parameters */
+	return seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
+}
+
+static int smp_func_module_lp_init(void *data)
+{
+	int ret, cpu = smp_processor_id();
+
+	ret = seamcall_lp_init();
+	if (!ret)
+		cpumask_set_cpu(cpu, cpu_tdx_mask);
+
+	return ret;
+}
+
+static bool skip_func_module_lp_init_done(int cpu, void *data)
+{
+	return cpumask_test_cpu(cpu, cpu_tdx_mask);
+}
+
+static int module_lp_init_online_cpus(void)
+{
+	return tdx_on_each_cpu_cond(smp_func_module_lp_init, NULL,
+			skip_func_module_lp_init_done, NULL);
+}
+
 static int init_tdx_module(void)
 {
 	int ret;
@@ -182,10 +245,26 @@  static int init_tdx_module(void)
 	if (ret)
 		return ret;
 
+	/*
+	 * TDX module per-cpu initialization SEAMCALL must be done on
+	 * one cpu before any other SEAMCALLs can be made on that cpu,
+	 * including those involved in further steps to initialize the
+	 * TDX module.
+	 *
+	 * To make sure further SEAMCALLs can be done successfully w/o
+	 * having to consider preemption, disable CPU hotplug during
+	 * rest of module initialization and do per-cpu initialization
+	 * for all online cpus.
+	 */
+	cpus_read_lock();
+
+	ret = module_lp_init_online_cpus();
+	if (ret)
+		goto out;
+
 	/*
 	 * TODO:
 	 *
-	 *  - TDX module per-cpu initialization.
 	 *  - Get TDX module information and TDX-capable memory regions.
 	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
@@ -196,7 +275,17 @@  static int init_tdx_module(void)
 	 *
 	 *  Return error before all steps are done.
 	 */
-	return -EINVAL;
+	ret = -EINVAL;
+out:
+	/*
+	 * Clear @cpu_tdx_mask if module initialization fails before
+	 * CPU hotplug is re-enabled.  tdx_cpu_online() uses it to check
+	 * whether the initialization has been successful or not.
+	 */
+	if (ret)
+		cpumask_clear(cpu_tdx_mask);
+	cpus_read_unlock();
+	return ret;
 }
 
 static int __tdx_enable(void)
@@ -220,13 +309,72 @@  static int __tdx_enable(void)
 	return 0;
 }
 
+/*
+ * Disable TDX module after it has been initialized successfully.
+ */
+static void disable_tdx_module(void)
+{
+	/*
+	 * TODO: module clean up in reverse to steps in
+	 * init_tdx_module().  Remove this comment after
+	 * all steps are done.
+	 */
+	cpumask_clear(cpu_tdx_mask);
+}
+
+static int tdx_module_init_online_cpus(void)
+{
+	int ret;
+
+	/*
+	 * Make sure no cpu can become online to prevent
+	 * race against tdx_cpu_online().
+	 */
+	cpus_read_lock();
+
+	/*
+	 * Do per-cpu initialization for any new online cpus.
+	 * If any fails, disable TDX.
+	 */
+	ret = module_lp_init_online_cpus();
+	if (ret)
+		disable_tdx_module();
+
+	cpus_read_unlock();
+
+	return ret;
+
+}
+static int __tdx_enable_online_cpus(void)
+{
+	if (tdx_module_init_online_cpus()) {
+		/*
+		 * SEAMCALL failure has already printed
+		 * meaningful error message.
+		 */
+		tdx_module_status = TDX_MODULE_ERROR;
+
+		/*
+		 * Just return one universal error code.
+		 * For now the caller cannot recover anyway.
+		 */
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 /**
  * tdx_enable - Enable TDX to be ready to run TDX guests
  *
  * Initialize the TDX module to enable TDX.  After this function, the TDX
- * module is ready to create and run TDX guests.
+ * module is ready to create and run TDX guests on all online cpus.
+ *
+ * This function internally calls cpus_read_lock()/unlock() to prevent
+ * any cpu from going online and offline.
  *
  * This function assumes all online cpus are already in VMX operation.
+ *
  * This function can be called in parallel by multiple callers.
  *
  * Return 0 if TDX is enabled successfully, otherwise error.
@@ -247,8 +395,17 @@  int tdx_enable(void)
 		ret = __tdx_enable();
 		break;
 	case TDX_MODULE_INITIALIZED:
-		/* Already initialized, great, tell the caller. */
-		ret = 0;
+		/*
+		 * The previous call of __tdx_enable() may only have
+		 * initialized part of present cpus during module
+		 * initialization, and new cpus may have become online
+		 * since then.
+		 *
+		 * To make sure all online cpus are TDX-runnable, always
+		 * do per-cpu initialization for all online cpus here
+		 * even the module has been initialized.
+		 */
+		ret = __tdx_enable_online_cpus();
 		break;
 	default:
 		/* Failed to initialize in the previous attempts */
@@ -261,3 +418,46 @@  int tdx_enable(void)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(tdx_enable);
+
+/**
+ * tdx_cpu_online - Enable TDX on a hotplugged local cpu
+ *
+ * @cpu: the cpu to be brought up.
+ *
+ * Do TDX module per-cpu initialization for a hotplugged cpu to make
+ * it TDX-runnable.  All online cpus are initialized during module
+ * initialization.
+ *
+ * This function must be called from CPU hotplug callback which holds
+ * write lock of cpu_hotplug_lock.
+ *
+ * This function assumes local cpu is already in VMX operation.
+ */
+int tdx_cpu_online(unsigned int cpu)
+{
+	int ret;
+
+	/*
+	 * @cpu_tdx_mask is updated in tdx_enable() and is protected
+	 * by cpus_read_lock()/unlock().  If it is empty, TDX module
+	 * either hasn't been initialized, or TDX didn't get enabled
+	 * successfully.
+	 *
+	 * In either case, do nothing but return success.
+	 */
+	if (cpumask_empty(cpu_tdx_mask))
+		return 0;
+
+	WARN_ON_ONCE(cpu != smp_processor_id());
+
+	/* Already done */
+	if (cpumask_test_cpu(cpu, cpu_tdx_mask))
+		return 0;
+
+	ret = seamcall_lp_init();
+	if (!ret)
+		cpumask_set_cpu(cpu, cpu_tdx_mask);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_online);
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 55472e085bc8..30413d7fbee8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -15,6 +15,7 @@ 
   * TDX module SEAMCALL leaf functions
   */
 #define TDH_SYS_INIT		33
+#define TDH_SYS_LP_INIT		35
 
 /* Kernel defined TDX module status during module initialization. */
 enum tdx_module_status_t {