diff mbox series

[v8,20/40] x86/sev: Use SEV-SNP AP creation to start secondary CPUs

Message ID 20211210154332.11526-21-brijesh.singh@amd.com (mailing list archive)
State New
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Guest Support | expand

Commit Message

Brijesh Singh Dec. 10, 2021, 3:43 p.m. UTC
From: Tom Lendacky <thomas.lendacky@amd.com>

To provide a more secure way to start APs under SEV-SNP, use the SEV-SNP
AP Creation NAE event. This allows for guest control over the AP register
state rather than trusting the hypervisor with the SEV-ES Jump Table
address.

During native_smp_prepare_cpus(), invoke an SEV-SNP function that, if
SEV-SNP is active, will set/override apic->wakeup_secondary_cpu. This
will allow the SEV-SNP AP Creation NAE event method to be used to boot
the APs. As a result of installing the override when SEV-SNP is active,
this method of starting the APs becomes the required method. The override
function will fail to start the AP if the hypervisor does not have
support for AP creation.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/sev-common.h |   1 +
 arch/x86/include/asm/sev.h        |   4 +
 arch/x86/include/uapi/asm/svm.h   |   5 +
 arch/x86/kernel/sev.c             | 229 ++++++++++++++++++++++++++++++
 arch/x86/kernel/smpboot.c         |   3 +
 5 files changed, 242 insertions(+)

Comments

Dave Hansen Dec. 10, 2021, 6:50 p.m. UTC | #1
On 12/10/21 7:43 AM, Brijesh Singh wrote:
> +	vmsa->efer		= 0x1000;	/* Must set SVME bit */
> +	vmsa->cr4		= cr4;
> +	vmsa->cr0		= 0x60000010;
> +	vmsa->dr7		= 0x400;
> +	vmsa->dr6		= 0xffff0ff0;
> +	vmsa->rflags		= 0x2;
> +	vmsa->g_pat		= 0x0007040600070406ULL;
> +	vmsa->xcr0		= 0x1;
> +	vmsa->mxcsr		= 0x1f80;
> +	vmsa->x87_ftw		= 0x5555;
> +	vmsa->x87_fcw		= 0x0040;

This is a big fat pile of magic numbers.  We also have nice macros for a
non-zero number of these, like:

	#define MXCSR_DEFAULT 0x1f80

I understand that this probably _works_ as-is, but it doesn't look very
friendly if someone else needs to go hack on it.
Borislav Petkov Dec. 31, 2021, 3:36 p.m. UTC | #2
On Fri, Dec 10, 2021 at 09:43:12AM -0600, Brijesh Singh wrote:
> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
> index 123a96f7dff2..38c14601ae4a 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -104,6 +104,7 @@ enum psc_op {
>  	(((u64)(v) & GENMASK_ULL(63, 12)) >> 12)
>  
>  #define GHCB_HV_FT_SNP			BIT_ULL(0)
> +#define GHCB_HV_FT_SNP_AP_CREATION	(BIT_ULL(1) | GHCB_HV_FT_SNP)

Why is bit 0 ORed in? Because it "Requires SEV-SNP Feature."?

You can still enforce that requirement in the test though.

Or all those SEV features should not be bits but masks -
GHCB_HV_FT_SNP_AP_CREATION_MASK for example, seeing how the others
require the previous bits to be set too.

...

>  static DEFINE_PER_CPU(struct sev_es_runtime_data*, runtime_data);
>  DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
>  
> +static DEFINE_PER_CPU(struct sev_es_save_area *, snp_vmsa);

This is what I mean: the struct is called "sev_es... " but the variable
"snp_...". I.e., it is all sev_<something>.

> +
>  static __always_inline bool on_vc_stack(struct pt_regs *regs)
>  {
>  	unsigned long sp = regs->sp;
> @@ -814,6 +818,231 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
>  	pvalidate_pages(vaddr, npages, 1);
>  }
>  
> +static int snp_set_vmsa(void *va, bool vmsa)
> +{
> +	u64 attrs;
> +
> +	/*
> +	 * The RMPADJUST instruction is used to set or clear the VMSA bit for
> +	 * a page. A change to the VMSA bit is only performed when running
> +	 * at VMPL0 and is ignored at other VMPL levels. If too low of a target

What does "too low" mean here exactly?

The kernel is not at VMPL0 but the specified level is lower? Weird...

> +	 * VMPL level is specified, the instruction can succeed without changing
> +	 * the VMSA bit should the kernel not be in VMPL0. Using a target VMPL
> +	 * level of 1 will return a FAIL_PERMISSION error if the kernel is not
> +	 * at VMPL0, thus ensuring that the VMSA bit has been properly set when
> +	 * no error is returned.

We do check whether we run at VMPL0 earlier when starting the guest -
see enforce_vmpl0().

I don't think you need any of that additional verification here - just
assume you are at VMPL0.

> +	 */
> +	attrs = 1;
> +	if (vmsa)
> +		attrs |= RMPADJUST_VMSA_PAGE_BIT;
> +
> +	return rmpadjust((unsigned long)va, RMP_PG_SIZE_4K, attrs);
> +}
> +
> +#define __ATTR_BASE		(SVM_SELECTOR_P_MASK | SVM_SELECTOR_S_MASK)
> +#define INIT_CS_ATTRIBS		(__ATTR_BASE | SVM_SELECTOR_READ_MASK | SVM_SELECTOR_CODE_MASK)
> +#define INIT_DS_ATTRIBS		(__ATTR_BASE | SVM_SELECTOR_WRITE_MASK)
> +
> +#define INIT_LDTR_ATTRIBS	(SVM_SELECTOR_P_MASK | 2)
> +#define INIT_TR_ATTRIBS		(SVM_SELECTOR_P_MASK | 3)
> +
> +static void *snp_safe_alloc_page(void)

safe?

And you don't need to say "safe" - snp_alloc_vmsa_page() is perfectly fine.

> +{
> +	unsigned long pfn;
> +	struct page *p;
> +
> +	/*
> +	 * Allocate an SNP safe page to workaround the SNP erratum where
> +	 * the CPU will incorrectly signal an RMP violation  #PF if a
> +	 * hugepage (2mb or 1gb) collides with the RMP entry of VMSA page.

		2MB or 1GB

Collides how? The 4K frame is inside the hugepage?

> +	 * The recommeded workaround is to not use the large page.

Unknown word [recommeded] in comment, suggestions:
        ['recommended', 'recommend', 'recommitted', 'commended', 'commandeered']

> +	 *
> +	 * Allocate one extra page, use a page which is not 2mb aligned

2MB-aligned

> +	 * and free the other.
> +	 */
> +	p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
> +	if (!p)
> +		return NULL;
> +
> +	split_page(p, 1);
> +
> +	pfn = page_to_pfn(p);
> +	if (IS_ALIGNED(__pfn_to_phys(pfn), PMD_SIZE)) {
> +		pfn++;
> +		__free_page(p);
> +	} else {
> +		__free_page(pfn_to_page(pfn + 1));
> +	}

AFAICT, this is doing all this stuff so that you can make sure you get a
non-2M-aligned page. I wonder if there's a way to simply ask mm to give
you such page directly.

vbabka?

> +
> +	return page_address(pfn_to_page(pfn));
> +}
> +
> +static int wakeup_cpu_via_vmgexit(int apic_id, unsigned long start_ip)
> +{
> +	struct sev_es_save_area *cur_vmsa, *vmsa;
> +	struct ghcb_state state;
> +	unsigned long flags;
> +	struct ghcb *ghcb;
> +	int cpu, err, ret;
> +	u8 sipi_vector;
> +	u64 cr4;
> +
> +	if ((sev_hv_features & GHCB_HV_FT_SNP_AP_CREATION) != GHCB_HV_FT_SNP_AP_CREATION)
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * Verify the desired start IP against the known trampoline start IP
> +	 * to catch any future new trampolines that may be introduced that
> +	 * would require a new protected guest entry point.
> +	 */
> +	if (WARN_ONCE(start_ip != real_mode_header->trampoline_start,
> +		      "Unsupported SEV-SNP start_ip: %lx\n", start_ip))
> +		return -EINVAL;
> +
> +	/* Override start_ip with known protected guest start IP */
> +	start_ip = real_mode_header->sev_es_trampoline_start;
> +
> +	/* Find the logical CPU for the APIC ID */
> +	for_each_present_cpu(cpu) {
> +		if (arch_match_cpu_phys_id(cpu, apic_id))
> +			break;
> +	}
> +	if (cpu >= nr_cpu_ids)
> +		return -EINVAL;
> +
> +	cur_vmsa = per_cpu(snp_vmsa, cpu);
> +
> +	/*
> +	 * A new VMSA is created each time because there is no guarantee that
> +	 * the current VMSA is the kernels or that the vCPU is not running. If

kernel's.

And if it is not the kernel's, whose it is?

> +	 * an attempt was done to use the current VMSA with a running vCPU, a
> +	 * #VMEXIT of that vCPU would wipe out all of the settings being done
> +	 * here.

I don't understand - this is waking up a CPU, how can it ever be a
running vCPU which is using the current VMSA?!

There is per_cpu(snp_vmsa, cpu), who else can be using that one currently?

> +	 */
> +	vmsa = (struct sev_es_save_area *)snp_safe_alloc_page();
> +	if (!vmsa)
> +		return -ENOMEM;
> +
> +	/* CR4 should maintain the MCE value */
> +	cr4 = native_read_cr4() & X86_CR4_MCE;
> +
> +	/* Set the CS value based on the start_ip converted to a SIPI vector */
> +	sipi_vector		= (start_ip >> 12);
> +	vmsa->cs.base		= sipi_vector << 12;
> +	vmsa->cs.limit		= 0xffff;
> +	vmsa->cs.attrib		= INIT_CS_ATTRIBS;
> +	vmsa->cs.selector	= sipi_vector << 8;
> +
> +	/* Set the RIP value based on start_ip */
> +	vmsa->rip		= start_ip & 0xfff;
> +
> +	/* Set VMSA entries to the INIT values as documented in the APM */
> +	vmsa->ds.limit		= 0xffff;
> +	vmsa->ds.attrib		= INIT_DS_ATTRIBS;
> +	vmsa->es		= vmsa->ds;
> +	vmsa->fs		= vmsa->ds;
> +	vmsa->gs		= vmsa->ds;
> +	vmsa->ss		= vmsa->ds;
> +
> +	vmsa->gdtr.limit	= 0xffff;
> +	vmsa->ldtr.limit	= 0xffff;
> +	vmsa->ldtr.attrib	= INIT_LDTR_ATTRIBS;
> +	vmsa->idtr.limit	= 0xffff;
> +	vmsa->tr.limit		= 0xffff;
> +	vmsa->tr.attrib		= INIT_TR_ATTRIBS;
> +
> +	vmsa->efer		= 0x1000;	/* Must set SVME bit */

verify_comment_style: Warning: No tail comments please:
 arch/x86/kernel/sev.c:954 [+   vmsa->efer              = 0x1000;       /* Must set SVME bit */]

> +	vmsa->cr4		= cr4;
> +	vmsa->cr0		= 0x60000010;
> +	vmsa->dr7		= 0x400;
> +	vmsa->dr6		= 0xffff0ff0;
> +	vmsa->rflags		= 0x2;
> +	vmsa->g_pat		= 0x0007040600070406ULL;
> +	vmsa->xcr0		= 0x1;
> +	vmsa->mxcsr		= 0x1f80;
> +	vmsa->x87_ftw		= 0x5555;
> +	vmsa->x87_fcw		= 0x0040;

Yah, those definitely need macros or at least comments ontop denoting
what those naked values are.

> +
> +	/*
> +	 * Set the SNP-specific fields for this VMSA:
> +	 *   VMPL level
> +	 *   SEV_FEATURES (matches the SEV STATUS MSR right shifted 2 bits)
> +	 */

Like this^^

> +	vmsa->vmpl		= 0;
> +	vmsa->sev_features	= sev_status >> 2;
> +
> +	/* Switch the page over to a VMSA page now that it is initialized */
> +	ret = snp_set_vmsa(vmsa, true);
> +	if (ret) {
> +		pr_err("set VMSA page failed (%u)\n", ret);
> +		free_page((unsigned long)vmsa);
> +
> +		return -EINVAL;
> +	}
> +
> +	/* Issue VMGEXIT AP Creation NAE event */
> +	local_irq_save(flags);
> +
> +	ghcb = __sev_get_ghcb(&state);
> +
> +	vc_ghcb_invalidate(ghcb);
> +	ghcb_set_rax(ghcb, vmsa->sev_features);
> +	ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_AP_CREATION);
> +	ghcb_set_sw_exit_info_1(ghcb, ((u64)apic_id << 32) | SVM_VMGEXIT_AP_CREATE);
> +	ghcb_set_sw_exit_info_2(ghcb, __pa(vmsa));
> +
> +	sev_es_wr_ghcb_msr(__pa(ghcb));
> +	VMGEXIT();
> +
> +	if (!ghcb_sw_exit_info_1_is_valid(ghcb) ||
> +	    lower_32_bits(ghcb->save.sw_exit_info_1)) {
> +		pr_alert("SNP AP Creation error\n");

alert?

> +		ret = -EINVAL;
> +	}
> +
> +	__sev_put_ghcb(&state);
> +
> +	local_irq_restore(flags);
> +
> +	/* Perform cleanup if there was an error */
> +	if (ret) {
> +		err = snp_set_vmsa(vmsa, false);
> +		if (err)
> +			pr_err("clear VMSA page failed (%u), leaking page\n", err);
> +		else
> +			free_page((unsigned long)vmsa);

That...

> +
> +		vmsa = NULL;
> +	}
> +
> +	/* Free up any previous VMSA page */
> +	if (cur_vmsa) {
> +		err = snp_set_vmsa(cur_vmsa, false);
> +		if (err)
> +			pr_err("clear VMSA page failed (%u), leaking page\n", err);
> +		else
> +			free_page((unsigned long)cur_vmsa);

.. and that wants to be in a common helper.

> +	}
> +
> +	/* Record the current VMSA page */
> +	per_cpu(snp_vmsa, cpu) = vmsa;
> +
> +	return ret;
> +}
Vlastimil Babka Jan. 3, 2022, 6:10 p.m. UTC | #3
On 12/31/21 16:36, Borislav Petkov wrote:
> On Fri, Dec 10, 2021 at 09:43:12AM -0600, Brijesh Singh wrote:
>> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
>> index 123a96f7dff2..38c14601ae4a 100644
> 
>> +{
>> +	unsigned long pfn;
>> +	struct page *p;
>> +
>> +	/*
>> +	 * Allocate an SNP safe page to workaround the SNP erratum where
>> +	 * the CPU will incorrectly signal an RMP violation  #PF if a
>> +	 * hugepage (2mb or 1gb) collides with the RMP entry of VMSA page.
> 
> 		2MB or 1GB
> 
> Collides how? The 4K frame is inside the hugepage?
> 
>> +	 * The recommeded workaround is to not use the large page.
> 
> Unknown word [recommeded] in comment, suggestions:
>         ['recommended', 'recommend', 'recommitted', 'commended', 'commandeered']
> 
>> +	 *
>> +	 * Allocate one extra page, use a page which is not 2mb aligned
> 
> 2MB-aligned
> 
>> +	 * and free the other.
>> +	 */
>> +	p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
>> +	if (!p)
>> +		return NULL;
>> +
>> +	split_page(p, 1);
>> +
>> +	pfn = page_to_pfn(p);
>> +	if (IS_ALIGNED(__pfn_to_phys(pfn), PMD_SIZE)) {
>> +		pfn++;
>> +		__free_page(p);
>> +	} else {
>> +		__free_page(pfn_to_page(pfn + 1));
>> +	}
> 
> AFAICT, this is doing all this stuff so that you can make sure you get a
> non-2M-aligned page. I wonder if there's a way to simply ask mm to give
> you such page directly.
> 
> vbabka?

AFAIK, not, as this is a very unusual constraint. Unless there are more
places that need it, should be fine to solve it like above. Maybe just also
be optimistic and try a plain order-0 first and only if it has the undesired
alignment (which should happen only once per 512 allocations), fallback to
the above?
Brijesh Singh Jan. 12, 2022, 4:17 p.m. UTC | #4
On 12/10/21 12:50 PM, Dave Hansen wrote:
> On 12/10/21 7:43 AM, Brijesh Singh wrote:
>> +	vmsa->efer		= 0x1000;	/* Must set SVME bit */
>> +	vmsa->cr4		= cr4;
>> +	vmsa->cr0		= 0x60000010;
>> +	vmsa->dr7		= 0x400;
>> +	vmsa->dr6		= 0xffff0ff0;
>> +	vmsa->rflags		= 0x2;
>> +	vmsa->g_pat		= 0x0007040600070406ULL;
>> +	vmsa->xcr0		= 0x1;
>> +	vmsa->mxcsr		= 0x1f80;
>> +	vmsa->x87_ftw		= 0x5555;
>> +	vmsa->x87_fcw		= 0x0040;
> 
> This is a big fat pile of magic numbers.  We also have nice macros for a
> non-zero number of these, like:
> 
> 	#define MXCSR_DEFAULT 0x1f80
> 
> I understand that this probably _works_ as-is, but it doesn't look very
> friendly if someone else needs to go hack on it.
> 

APM documents the default value for the AP following the RESET or INIT, 
I will define macros and use them accordingly.

thx
Brijesh Singh Jan. 12, 2022, 4:33 p.m. UTC | #5
On 12/31/21 9:36 AM, Borislav Petkov wrote:
> On Fri, Dec 10, 2021 at 09:43:12AM -0600, Brijesh Singh wrote:
>> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
>> index 123a96f7dff2..38c14601ae4a 100644
>> --- a/arch/x86/include/asm/sev-common.h
>> +++ b/arch/x86/include/asm/sev-common.h
>> @@ -104,6 +104,7 @@ enum psc_op {
>>   	(((u64)(v) & GENMASK_ULL(63, 12)) >> 12)
>>   
>>   #define GHCB_HV_FT_SNP			BIT_ULL(0)
>> +#define GHCB_HV_FT_SNP_AP_CREATION	(BIT_ULL(1) | GHCB_HV_FT_SNP)
> 
> Why is bit 0 ORed in? Because it "Requires SEV-SNP Feature."?
> 

Yes, the SEV-SNP feature is required. Anyway, I will improve a check. We 
will reach to AP creation only after SEV-SNP feature is checked, so, in 
AP creation routine we just need to check for the AP_CREATION specific 
feature flag; I will add comment about it.

> You can still enforce that requirement in the test though.
> 
> Or all those SEV features should not be bits but masks -
> GHCB_HV_FT_SNP_AP_CREATION_MASK for example, seeing how the others
> require the previous bits to be set too.
> 

> ...
> 
>>   static DEFINE_PER_CPU(struct sev_es_runtime_data*, runtime_data);
>>   DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
>>   
>> +static DEFINE_PER_CPU(struct sev_es_save_area *, snp_vmsa);
> 
> This is what I mean: the struct is called "sev_es... " but the variable
> "snp_...". I.e., it is all sev_<something>.
> 

Sure, I define the variable as sev_vmsa.

>> +
>>   static __always_inline bool on_vc_stack(struct pt_regs *regs)
>>   {
>>   	unsigned long sp = regs->sp;
>> @@ -814,6 +818,231 @@ void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
>>   	pvalidate_pages(vaddr, npages, 1);
>>   }
>>   
>> +static int snp_set_vmsa(void *va, bool vmsa)
>> +{
>> +	u64 attrs;
>> +
>> +	/*
>> +	 * The RMPADJUST instruction is used to set or clear the VMSA bit for
>> +	 * a page. A change to the VMSA bit is only performed when running
>> +	 * at VMPL0 and is ignored at other VMPL levels. If too low of a target
> 
> What does "too low" mean here exactly?
> 

I believe its saying that target VMPL is lesser than the current VMPL 
level. Now that we have VMPL0 check enforced in the beginning so will 
work on improving comment.

> The kernel is not at VMPL0 but the specified level is lower? Weird...
> 
>> +	 * VMPL level is specified, the instruction can succeed without changing
>> +	 * the VMSA bit should the kernel not be in VMPL0. Using a target VMPL
>> +	 * level of 1 will return a FAIL_PERMISSION error if the kernel is not
>> +	 * at VMPL0, thus ensuring that the VMSA bit has been properly set when
>> +	 * no error is returned.
> 
> We do check whether we run at VMPL0 earlier when starting the guest -
> see enforce_vmpl0().
> 
> I don't think you need any of that additional verification here - just
> assume you are at VMPL0.
> 

Yep.

>> +	 */
>> +	attrs = 1;
>> +	if (vmsa)
>> +		attrs |= RMPADJUST_VMSA_PAGE_BIT;
>> +
>> +	return rmpadjust((unsigned long)va, RMP_PG_SIZE_4K, attrs);
>> +}
>> +
>> +#define __ATTR_BASE		(SVM_SELECTOR_P_MASK | SVM_SELECTOR_S_MASK)
>> +#define INIT_CS_ATTRIBS		(__ATTR_BASE | SVM_SELECTOR_READ_MASK | SVM_SELECTOR_CODE_MASK)
>> +#define INIT_DS_ATTRIBS		(__ATTR_BASE | SVM_SELECTOR_WRITE_MASK)
>> +
>> +#define INIT_LDTR_ATTRIBS	(SVM_SELECTOR_P_MASK | 2)
>> +#define INIT_TR_ATTRIBS		(SVM_SELECTOR_P_MASK | 3)
>> +
>> +static void *snp_safe_alloc_page(void)
> 
> safe?
> 
> And you don't need to say "safe" - snp_alloc_vmsa_page() is perfectly fine.
> 

noted.

...

>> +
>> +	/*
>> +	 * A new VMSA is created each time because there is no guarantee that
>> +	 * the current VMSA is the kernels or that the vCPU is not running. If
> 
> kernel's.
> 
> And if it is not the kernel's, whose it is?

It could be hypervisor's VMSA.

> 
>> +	 * an attempt was done to use the current VMSA with a running vCPU, a
>> +	 * #VMEXIT of that vCPU would wipe out all of the settings being done
>> +	 * here.
> 
> I don't understand - this is waking up a CPU, how can it ever be a
> running vCPU which is using the current VMSA?!
> 
> There is per_cpu(snp_vmsa, cpu), who else can be using that one currently?
> 

Maybe Tom can expand it bit more?

...

>> +
>> +	if (!ghcb_sw_exit_info_1_is_valid(ghcb) ||
>> +	    lower_32_bits(ghcb->save.sw_exit_info_1)) {
>> +		pr_alert("SNP AP Creation error\n");
> 
> alert?

I see that smboot.c is using the pr_err() when failing to wakeup CPU; 
will switch to pr_err(), let me know if you don't agree with it.


thx
Tom Lendacky Jan. 12, 2022, 5:10 p.m. UTC | #6
On 1/12/22 10:33 AM, Brijesh Singh wrote:
> On 12/31/21 9:36 AM, Borislav Petkov wrote:
>> On Fri, Dec 10, 2021 at 09:43:12AM -0600, Brijesh Singh wrote:

>>> +     * an attempt was done to use the current VMSA with a running vCPU, a
>>> +     * #VMEXIT of that vCPU would wipe out all of the settings being done
>>> +     * here.
>>
>> I don't understand - this is waking up a CPU, how can it ever be a
>> running vCPU which is using the current VMSA?!

Yes, in general. My thought was that nothing is stopping a malicious 
hypervisor from performing a VMRUN on that vCPU and then the VMSA would be 
in use.

Thanks,
Tom

>>
>> There is per_cpu(snp_vmsa, cpu), who else can be using that one currently?
>>
> 
> Maybe Tom can expand it bit more?
>
diff mbox series

Patch

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 123a96f7dff2..38c14601ae4a 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -104,6 +104,7 @@  enum psc_op {
 	(((u64)(v) & GENMASK_ULL(63, 12)) >> 12)
 
 #define GHCB_HV_FT_SNP			BIT_ULL(0)
+#define GHCB_HV_FT_SNP_AP_CREATION	(BIT_ULL(1) | GHCB_HV_FT_SNP)
 
 /* SNP Page State Change NAE event */
 #define VMGEXIT_PSC_MAX_ENTRY		253
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index f5d0569fd02b..f7cbd5164136 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -66,6 +66,8 @@  extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 
+#define RMPADJUST_VMSA_PAGE_BIT		BIT(16)
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 extern struct static_key_false sev_es_enable_key;
 extern void __sev_es_ist_enter(struct pt_regs *regs);
@@ -130,6 +132,7 @@  void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
 void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op);
 void snp_set_memory_shared(unsigned long vaddr, unsigned int npages);
 void snp_set_memory_private(unsigned long vaddr, unsigned int npages);
+void snp_set_wakeup_secondary_cpu(void);
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
@@ -146,6 +149,7 @@  early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr, unsigned i
 static inline void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op) { }
 static inline void snp_set_memory_shared(unsigned long vaddr, unsigned int npages) { }
 static inline void snp_set_memory_private(unsigned long vaddr, unsigned int npages) { }
+static inline void snp_set_wakeup_secondary_cpu(void) { }
 #endif
 
 #endif
diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index 0dcdb6e0c913..8b4c57baec52 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -109,6 +109,10 @@ 
 #define SVM_VMGEXIT_SET_AP_JUMP_TABLE		0
 #define SVM_VMGEXIT_GET_AP_JUMP_TABLE		1
 #define SVM_VMGEXIT_PSC				0x80000010
+#define SVM_VMGEXIT_AP_CREATION			0x80000013
+#define SVM_VMGEXIT_AP_CREATE_ON_INIT		0
+#define SVM_VMGEXIT_AP_CREATE			1
+#define SVM_VMGEXIT_AP_DESTROY			2
 #define SVM_VMGEXIT_HV_FEATURES			0x8000fffd
 #define SVM_VMGEXIT_UNSUPPORTED_EVENT		0x8000ffff
 
@@ -221,6 +225,7 @@ 
 	{ SVM_VMGEXIT_AP_HLT_LOOP,	"vmgexit_ap_hlt_loop" }, \
 	{ SVM_VMGEXIT_AP_JUMP_TABLE,	"vmgexit_ap_jump_table" }, \
 	{ SVM_VMGEXIT_PSC,	"vmgexit_page_state_change" }, \
+	{ SVM_VMGEXIT_AP_CREATION,	"vmgexit_ap_creation" }, \
 	{ SVM_VMGEXIT_HV_FEATURES,	"vmgexit_hypervisor_feature" }, \
 	{ SVM_EXIT_ERR,         "invalid_guest_state" }
 
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 35c772bf9f6c..21926b094378 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -18,6 +18,7 @@ 
 #include <linux/memblock.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/cpumask.h>
 
 #include <asm/cpu_entry_area.h>
 #include <asm/stacktrace.h>
@@ -31,6 +32,7 @@ 
 #include <asm/svm.h>
 #include <asm/smp.h>
 #include <asm/cpu.h>
+#include <asm/apic.h>
 
 #define DR7_RESET_VALUE        0x400
 
@@ -91,6 +93,8 @@  struct ghcb_state {
 static DEFINE_PER_CPU(struct sev_es_runtime_data*, runtime_data);
 DEFINE_STATIC_KEY_FALSE(sev_es_enable_key);
 
+static DEFINE_PER_CPU(struct sev_es_save_area *, snp_vmsa);
+
 static __always_inline bool on_vc_stack(struct pt_regs *regs)
 {
 	unsigned long sp = regs->sp;
@@ -814,6 +818,231 @@  void snp_set_memory_private(unsigned long vaddr, unsigned int npages)
 	pvalidate_pages(vaddr, npages, 1);
 }
 
+static int snp_set_vmsa(void *va, bool vmsa)
+{
+	u64 attrs;
+
+	/*
+	 * The RMPADJUST instruction is used to set or clear the VMSA bit for
+	 * a page. A change to the VMSA bit is only performed when running
+	 * at VMPL0 and is ignored at other VMPL levels. If too low of a target
+	 * VMPL level is specified, the instruction can succeed without changing
+	 * the VMSA bit should the kernel not be in VMPL0. Using a target VMPL
+	 * level of 1 will return a FAIL_PERMISSION error if the kernel is not
+	 * at VMPL0, thus ensuring that the VMSA bit has been properly set when
+	 * no error is returned.
+	 */
+	attrs = 1;
+	if (vmsa)
+		attrs |= RMPADJUST_VMSA_PAGE_BIT;
+
+	return rmpadjust((unsigned long)va, RMP_PG_SIZE_4K, attrs);
+}
+
+#define __ATTR_BASE		(SVM_SELECTOR_P_MASK | SVM_SELECTOR_S_MASK)
+#define INIT_CS_ATTRIBS		(__ATTR_BASE | SVM_SELECTOR_READ_MASK | SVM_SELECTOR_CODE_MASK)
+#define INIT_DS_ATTRIBS		(__ATTR_BASE | SVM_SELECTOR_WRITE_MASK)
+
+#define INIT_LDTR_ATTRIBS	(SVM_SELECTOR_P_MASK | 2)
+#define INIT_TR_ATTRIBS		(SVM_SELECTOR_P_MASK | 3)
+
+static void *snp_safe_alloc_page(void)
+{
+	unsigned long pfn;
+	struct page *p;
+
+	/*
+	 * Allocate an SNP safe page to workaround the SNP erratum where
+	 * the CPU will incorrectly signal an RMP violation  #PF if a
+	 * hugepage (2mb or 1gb) collides with the RMP entry of VMSA page.
+	 * The recommeded workaround is to not use the large page.
+	 *
+	 * Allocate one extra page, use a page which is not 2mb aligned
+	 * and free the other.
+	 */
+	p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
+	if (!p)
+		return NULL;
+
+	split_page(p, 1);
+
+	pfn = page_to_pfn(p);
+	if (IS_ALIGNED(__pfn_to_phys(pfn), PMD_SIZE)) {
+		pfn++;
+		__free_page(p);
+	} else {
+		__free_page(pfn_to_page(pfn + 1));
+	}
+
+	return page_address(pfn_to_page(pfn));
+}
+
+static int wakeup_cpu_via_vmgexit(int apic_id, unsigned long start_ip)
+{
+	struct sev_es_save_area *cur_vmsa, *vmsa;
+	struct ghcb_state state;
+	unsigned long flags;
+	struct ghcb *ghcb;
+	int cpu, err, ret;
+	u8 sipi_vector;
+	u64 cr4;
+
+	if ((sev_hv_features & GHCB_HV_FT_SNP_AP_CREATION) != GHCB_HV_FT_SNP_AP_CREATION)
+		return -EOPNOTSUPP;
+
+	/*
+	 * Verify the desired start IP against the known trampoline start IP
+	 * to catch any future new trampolines that may be introduced that
+	 * would require a new protected guest entry point.
+	 */
+	if (WARN_ONCE(start_ip != real_mode_header->trampoline_start,
+		      "Unsupported SEV-SNP start_ip: %lx\n", start_ip))
+		return -EINVAL;
+
+	/* Override start_ip with known protected guest start IP */
+	start_ip = real_mode_header->sev_es_trampoline_start;
+
+	/* Find the logical CPU for the APIC ID */
+	for_each_present_cpu(cpu) {
+		if (arch_match_cpu_phys_id(cpu, apic_id))
+			break;
+	}
+	if (cpu >= nr_cpu_ids)
+		return -EINVAL;
+
+	cur_vmsa = per_cpu(snp_vmsa, cpu);
+
+	/*
+	 * A new VMSA is created each time because there is no guarantee that
+	 * the current VMSA is the kernels or that the vCPU is not running. If
+	 * an attempt was done to use the current VMSA with a running vCPU, a
+	 * #VMEXIT of that vCPU would wipe out all of the settings being done
+	 * here.
+	 */
+	vmsa = (struct sev_es_save_area *)snp_safe_alloc_page();
+	if (!vmsa)
+		return -ENOMEM;
+
+	/* CR4 should maintain the MCE value */
+	cr4 = native_read_cr4() & X86_CR4_MCE;
+
+	/* Set the CS value based on the start_ip converted to a SIPI vector */
+	sipi_vector		= (start_ip >> 12);
+	vmsa->cs.base		= sipi_vector << 12;
+	vmsa->cs.limit		= 0xffff;
+	vmsa->cs.attrib		= INIT_CS_ATTRIBS;
+	vmsa->cs.selector	= sipi_vector << 8;
+
+	/* Set the RIP value based on start_ip */
+	vmsa->rip		= start_ip & 0xfff;
+
+	/* Set VMSA entries to the INIT values as documented in the APM */
+	vmsa->ds.limit		= 0xffff;
+	vmsa->ds.attrib		= INIT_DS_ATTRIBS;
+	vmsa->es		= vmsa->ds;
+	vmsa->fs		= vmsa->ds;
+	vmsa->gs		= vmsa->ds;
+	vmsa->ss		= vmsa->ds;
+
+	vmsa->gdtr.limit	= 0xffff;
+	vmsa->ldtr.limit	= 0xffff;
+	vmsa->ldtr.attrib	= INIT_LDTR_ATTRIBS;
+	vmsa->idtr.limit	= 0xffff;
+	vmsa->tr.limit		= 0xffff;
+	vmsa->tr.attrib		= INIT_TR_ATTRIBS;
+
+	vmsa->efer		= 0x1000;	/* Must set SVME bit */
+	vmsa->cr4		= cr4;
+	vmsa->cr0		= 0x60000010;
+	vmsa->dr7		= 0x400;
+	vmsa->dr6		= 0xffff0ff0;
+	vmsa->rflags		= 0x2;
+	vmsa->g_pat		= 0x0007040600070406ULL;
+	vmsa->xcr0		= 0x1;
+	vmsa->mxcsr		= 0x1f80;
+	vmsa->x87_ftw		= 0x5555;
+	vmsa->x87_fcw		= 0x0040;
+
+	/*
+	 * Set the SNP-specific fields for this VMSA:
+	 *   VMPL level
+	 *   SEV_FEATURES (matches the SEV STATUS MSR right shifted 2 bits)
+	 */
+	vmsa->vmpl		= 0;
+	vmsa->sev_features	= sev_status >> 2;
+
+	/* Switch the page over to a VMSA page now that it is initialized */
+	ret = snp_set_vmsa(vmsa, true);
+	if (ret) {
+		pr_err("set VMSA page failed (%u)\n", ret);
+		free_page((unsigned long)vmsa);
+
+		return -EINVAL;
+	}
+
+	/* Issue VMGEXIT AP Creation NAE event */
+	local_irq_save(flags);
+
+	ghcb = __sev_get_ghcb(&state);
+
+	vc_ghcb_invalidate(ghcb);
+	ghcb_set_rax(ghcb, vmsa->sev_features);
+	ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_AP_CREATION);
+	ghcb_set_sw_exit_info_1(ghcb, ((u64)apic_id << 32) | SVM_VMGEXIT_AP_CREATE);
+	ghcb_set_sw_exit_info_2(ghcb, __pa(vmsa));
+
+	sev_es_wr_ghcb_msr(__pa(ghcb));
+	VMGEXIT();
+
+	if (!ghcb_sw_exit_info_1_is_valid(ghcb) ||
+	    lower_32_bits(ghcb->save.sw_exit_info_1)) {
+		pr_alert("SNP AP Creation error\n");
+		ret = -EINVAL;
+	}
+
+	__sev_put_ghcb(&state);
+
+	local_irq_restore(flags);
+
+	/* Perform cleanup if there was an error */
+	if (ret) {
+		err = snp_set_vmsa(vmsa, false);
+		if (err)
+			pr_err("clear VMSA page failed (%u), leaking page\n", err);
+		else
+			free_page((unsigned long)vmsa);
+
+		vmsa = NULL;
+	}
+
+	/* Free up any previous VMSA page */
+	if (cur_vmsa) {
+		err = snp_set_vmsa(cur_vmsa, false);
+		if (err)
+			pr_err("clear VMSA page failed (%u), leaking page\n", err);
+		else
+			free_page((unsigned long)cur_vmsa);
+	}
+
+	/* Record the current VMSA page */
+	per_cpu(snp_vmsa, cpu) = vmsa;
+
+	return ret;
+}
+
+void snp_set_wakeup_secondary_cpu(void)
+{
+	if (!cc_platform_has(CC_ATTR_SEV_SNP))
+		return;
+
+	/*
+	 * Always set this override if SEV-SNP is enabled. This makes it the
+	 * required method to start APs under SEV-SNP. If the hypervisor does
+	 * not support AP creation, then no APs will be started.
+	 */
+	apic->wakeup_secondary_cpu = wakeup_cpu_via_vmgexit;
+}
+
 int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
 {
 	u16 startup_cs, startup_ip;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ac2909f0cab3..9eca0b8a72e9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -82,6 +82,7 @@ 
 #include <asm/spec-ctrl.h>
 #include <asm/hw_irq.h>
 #include <asm/stackprotector.h>
+#include <asm/sev.h>
 
 #ifdef CONFIG_ACPI_CPPC_LIB
 #include <acpi/cppc_acpi.h>
@@ -1425,6 +1426,8 @@  void __init native_smp_prepare_cpus(unsigned int max_cpus)
 	smp_quirk_init_udelay();
 
 	speculative_store_bypass_ht_init();
+
+	snp_set_wakeup_secondary_cpu();
 }
 
 void arch_thaw_secondary_cpus_begin(void)