diff mbox series

[RFC,Part2,04/30] x86/mm: split the physmap when adding the page in RMP table

Message ID 20210324170436.31843-5-brijesh.singh@amd.com (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand

Commit Message

Brijesh Singh March 24, 2021, 5:04 p.m. UTC
The integrity guarantee of SEV-SNP is enforced through the RMP table.
The RMP is used in conjuntion with standard x86 and IOMMU page
tables to enforce memory restrictions and page access rights. The
RMP is indexed by system physical address, and is checked at the end
of CPU and IOMMU table walks. The RMP check is enforced as soon as
SEV-SNP is enabled globally in the system. Not every memory access
requires an RMP check. In particular, the read accesses from the
hypervisor do not require RMP checks because the data confidentiality
is already protected via memory encryption. When hardware encounters
an RMP checks failure, it raise a page-fault exception. The RMP bit in
fault error code can be used to determine if the fault was due to an
RMP checks failure.

A write from the hypervisor goes through the RMP checks. When the
hypervisor writes to pages, hardware checks to ensures that the assigned
bit in the RMP is zero (i.e page is shared). If the page table entry that
gives the sPA indicates that the target page size is a large page, then
all RMP entries for the 4KB constituting pages of the target must have the
assigned bit 0. If one of entry does not have assigned bit 0 then hardware
will raise an RMP violation. To resolve it, we must split the page table
entry leading to target page into 4K.

This poses a challenge in the Linux memory model. The Linux kernel
creates a direct mapping of all the physical memory -- referred to as
the physmap. The physmap may contain a valid mapping of guest owned pages.
During the page table walk, we may get into the situation where one
of the pages within the large page is owned by the guest (i.e assigned
bit is set in RMP). A write to a non-guest within the large page will
raise an RMP violation. To workaround it, we call set_memory_4k() to split
the physmap before adding the page in the RMP table. This ensures that the
pages added in the RMP table are used as 4K in the physmap.

The spliting of the physmap is a temporary solution until we work to
improve the kernel page fault handler to split the pages on demand.
One of the disadvtange of splitting is that eventually, we will end up
breaking down the entire physmap unless we combine the split pages back to
a large page. I am open to the suggestation on various approaches we could
take to address this problem.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/mm/mem_encrypt.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Dave Hansen March 25, 2021, 3:17 p.m. UTC | #1
On 3/24/21 10:04 AM, Brijesh Singh wrote:
> The spliting of the physmap is a temporary solution until we work to
> improve the kernel page fault handler to split the pages on demand.
> One of the disadvtange of splitting is that eventually, we will end up
> breaking down the entire physmap unless we combine the split pages back to
> a large page. I am open to the suggestation on various approaches we could
> take to address this problem.

Other than suggesting that the hardware be fixed to do the fracturing
itself?  :)

I suspect that this code is trying to be *too* generic.  I would expect
that very little of guest memory is actually shared with the host.  It's
also not going to be random guest pages.  The guest and the host have to
agree on these things, and I *think* the host is free to move the
physical backing around once it's shared.

So, let's say that there a guest->host paravirt interface where the
guest says in advance, "I want to share this page."  The host can split
at *that* point and *only* split that one page's mapping.  Any page
faults would occur only if the host screws up, and would result in an oops.

That also gives a point where the host can say, "nope, that hugetlbfs, I
can't split it".
Borislav Petkov April 19, 2021, 12:32 p.m. UTC | #2
On Wed, Mar 24, 2021 at 12:04:10PM -0500, Brijesh Singh wrote:
> A write from the hypervisor goes through the RMP checks. When the
> hypervisor writes to pages, hardware checks to ensures that the assigned
> bit in the RMP is zero (i.e page is shared). If the page table entry that
> gives the sPA indicates that the target page size is a large page, then
> all RMP entries for the 4KB constituting pages of the target must have the
> assigned bit 0.

Hmm, so this is important: I read this such that we can have a 2M
page table entry but the RMP table can contain 4K entries for the
corresponding 512 4K pages. Is that correct?

If so, then there's a certain discrepancy here and I'd expect that if
the page gets split/collapsed, depending on the result, the RMP table
should be updated too, so that it remains in sync.

For example:

* mm decides to group all 512 4K entries into a 2M entry, RMP table gets
updated in the end to reflect that

* mm decides to split a page, RMP table gets updated too, for the same
reason.

In this way, RMP table will be always in sync with the pagetables.

I know, I probably am missing something but that makes most sense to
me instead of noticing the discrepancy and getting to work then, when
handling the RMP violation.

Or?

Thx.
Brijesh Singh April 19, 2021, 3:25 p.m. UTC | #3
On 4/19/21 7:32 AM, Borislav Petkov wrote:
> On Wed, Mar 24, 2021 at 12:04:10PM -0500, Brijesh Singh wrote:
>> A write from the hypervisor goes through the RMP checks. When the
>> hypervisor writes to pages, hardware checks to ensures that the assigned
>> bit in the RMP is zero (i.e page is shared). If the page table entry that
>> gives the sPA indicates that the target page size is a large page, then
>> all RMP entries for the 4KB constituting pages of the target must have the
>> assigned bit 0.
> Hmm, so this is important: I read this such that we can have a 2M
> page table entry but the RMP table can contain 4K entries for the
> corresponding 512 4K pages. Is that correct?

Yes that is correct.


>
> If so, then there's a certain discrepancy here and I'd expect that if
> the page gets split/collapsed, depending on the result, the RMP table
> should be updated too, so that it remains in sync.

Yes that is correct. For write access to succeed we need both the x86
and RMP page tables in sync.

>
> For example:
>
> * mm decides to group all 512 4K entries into a 2M entry, RMP table gets
> updated in the end to reflect that

To my understanding, we don't group 512 4K entries into a 2M for the
kernel address range. We do this for the userspace address through
khugepage daemon. If page tables get out of sync then it will cause an
RMP violation, the Patch #7 adds support to split the pages on demand.


>
> * mm decides to split a page, RMP table gets updated too, for the same
> reason.
>
> In this way, RMP table will be always in sync with the pagetables.
>
> I know, I probably am missing something but that makes most sense to
> me instead of noticing the discrepancy and getting to work then, when
> handling the RMP violation.
>
> Or?
>
> Thx.
>
Borislav Petkov April 19, 2021, 4:52 p.m. UTC | #4
On Mon, Apr 19, 2021 at 10:25:01AM -0500, Brijesh Singh wrote:
> To my understanding, we don't group 512 4K entries into a 2M for the
> kernel address range. We do this for the userspace address through
> khugepage daemon. If page tables get out of sync then it will cause an
> RMP violation, the Patch #7 adds support to split the pages on demand.

Ok. So I haven't reviewed the whole thing but, is it possible to keep
the RMP table in sync so that you don't have to split the physmap like
you do in this patch?

I.e., if the physmap page is 2M, then you have a corresponding RMP entry
of 2M so that you don't have to split. And if you have 4K, then the
corresponding RMP entry is 4K. You get the idea...

IOW, when does that happen: "During the page table walk, we may get into
the situation where one of the pages within the large page is owned by
the guest (i.e assigned bit is set in RMP)." In which case is a 4K page
- as part of a 2M physmap mapping - owned by a guest?

Thx.
Dave Hansen April 19, 2021, 5:58 p.m. UTC | #5
On 4/19/21 10:46 AM, Brijesh Singh wrote:
> - guest wants to make gpa 0x1000 as a shared page. To support this, we
> need to psmash the large RMP entry into 512 4K entries. The psmash
> instruction breaks the large RMP entry into 512 4K entries without
> affecting the previous validation. Now the we need to force the host to
> use the 4K page level instead of the 2MB.
> 
> To my understanding, Linux kernel fault handler does not build the page
> tables on demand for the kernel addresses. All kernel addresses are
> pre-mapped on the boot. Currently, I am proactively spitting the physmap
> to avoid running into situation where x86 page level is greater than the
> RMP page level.

In other words, if the host maps guest memory with 2M mappings, the
guest can induce page faults in the host.  The only way the host can
avoid this is to map everything with 4k mappings.

If the host does not avoid this, it could end up in the situation where
it gets page faults on access to kernel data structures.  Imagine if a
kernel stack page ended up in the same 2M mapping as a guest page.  I
*think* the next write to the kernel stack would end up double-faulting.
Andy Lutomirski April 19, 2021, 6:10 p.m. UTC | #6
> On Apr 19, 2021, at 10:58 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/19/21 10:46 AM, Brijesh Singh wrote:
>> - guest wants to make gpa 0x1000 as a shared page. To support this, we
>> need to psmash the large RMP entry into 512 4K entries. The psmash
>> instruction breaks the large RMP entry into 512 4K entries without
>> affecting the previous validation. Now the we need to force the host to
>> use the 4K page level instead of the 2MB.
>> 
>> To my understanding, Linux kernel fault handler does not build the page
>> tables on demand for the kernel addresses. All kernel addresses are
>> pre-mapped on the boot. Currently, I am proactively spitting the physmap
>> to avoid running into situation where x86 page level is greater than the
>> RMP page level.
> 
> In other words, if the host maps guest memory with 2M mappings, the
> guest can induce page faults in the host.  The only way the host can
> avoid this is to map everything with 4k mappings.
> 
> If the host does not avoid this, it could end up in the situation where
> it gets page faults on access to kernel data structures.  Imagine if a
> kernel stack page ended up in the same 2M mapping as a guest page.  I
> *think* the next write to the kernel stack would end up double-faulting.

I’m confused by this scenario. This should only affect physical pages that are in the 2M area that contains guest memory. But, if we have a 2M direct map PMD entry that contains kernel data and guest private memory, we’re already in a situation in which the kernel touching that memory would machine check, right?

ISTM we should fully unmap any guest private page from the kernel and all host user pagetables before actually making it be a guest private page.
Dave Hansen April 19, 2021, 6:33 p.m. UTC | #7
On 4/19/21 11:10 AM, Andy Lutomirski wrote:
> I’m confused by this scenario. This should only affect physical pages
> that are in the 2M area that contains guest memory. But, if we have a
> 2M direct map PMD entry that contains kernel data and guest private
> memory, we’re already in a situation in which the kernel touching
> that memory would machine check, right?

Not machine check, but page fault.  Do machine checks even play a
special role in SEV-SNP?  I thought that was only TDX?

My point was just that you can't _easily_ do the 2M->4k kernel mapping
demotion in a page fault handler, like I think Borislav was suggesting.

> ISTM we should fully unmap any guest private page from the kernel and
> all host user pagetables before actually making it be a guest private
> page.

Yes, that sounds attractive.  Then, we'd actually know if the host
kernel was doing stray reads somehow because we'd get a fault there too.
Andy Lutomirski April 19, 2021, 6:37 p.m. UTC | #8
> On Apr 19, 2021, at 11:33 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/19/21 11:10 AM, Andy Lutomirski wrote:
>> I’m confused by this scenario. This should only affect physical pages
>> that are in the 2M area that contains guest memory. But, if we have a
>> 2M direct map PMD entry that contains kernel data and guest private
>> memory, we’re already in a situation in which the kernel touching
>> that memory would machine check, right?
> 
> Not machine check, but page fault.  Do machine checks even play a
> special role in SEV-SNP?  I thought that was only TDX?

Brain fart.

> 
> My point was just that you can't _easily_ do the 2M->4k kernel mapping
> demotion in a page fault handler, like I think Borislav was suggesting.

We are certainly toast if this hits the stack.  Or if it hits a page table or the GDT or IDT :). The latter delightful choices would be triple faults.

I sure hope the code we use to split a mapping is properly NMI safe.

> 
>> ISTM we should fully unmap any guest private page from the kernel and
>> all host user pagetables before actually making it be a guest private
>> page.
> 
> Yes, that sounds attractive.  Then, we'd actually know if the host
> kernel was doing stray reads somehow because we'd get a fault there too.
Brijesh Singh April 19, 2021, 9:25 p.m. UTC | #9
On 4/19/21 1:10 PM, Andy Lutomirski wrote:
>
>> On Apr 19, 2021, at 10:58 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 4/19/21 10:46 AM, Brijesh Singh wrote:
>>> - guest wants to make gpa 0x1000 as a shared page. To support this, we
>>> need to psmash the large RMP entry into 512 4K entries. The psmash
>>> instruction breaks the large RMP entry into 512 4K entries without
>>> affecting the previous validation. Now the we need to force the host to
>>> use the 4K page level instead of the 2MB.
>>>
>>> To my understanding, Linux kernel fault handler does not build the page
>>> tables on demand for the kernel addresses. All kernel addresses are
>>> pre-mapped on the boot. Currently, I am proactively spitting the physmap
>>> to avoid running into situation where x86 page level is greater than the
>>> RMP page level.
>> In other words, if the host maps guest memory with 2M mappings, the
>> guest can induce page faults in the host.  The only way the host can
>> avoid this is to map everything with 4k mappings.
>>
>> If the host does not avoid this, it could end up in the situation where
>> it gets page faults on access to kernel data structures.  Imagine if a
>> kernel stack page ended up in the same 2M mapping as a guest page.  I
>> *think* the next write to the kernel stack would end up double-faulting.
> I’m confused by this scenario. This should only affect physical pages that are in the 2M area that contains guest memory. But, if we have a 2M direct map PMD entry that contains kernel data and guest private memory, we’re already in a situation in which the kernel touching that memory would machine check, right?

When SEV-SNP is enabled in the host, a page can be in one of the
following state:

1. Hypevisor  (assigned = 0, Validated=0)

2. Firmware (assigned = 1, immutable=1)

3. Context/VMSA (assigned=1, vmsa=1)

4. Guest private (assigned = 1, Validated=1)


You are right that we should never run into situation where the kernel
data and guest page will be in the same PMD entry. 

During the SEV-VM creation, KVM allocates one firmware page and one vmsa
page for each vcpus. The firmware page is used by the SEV-SNP firmware
to keep some private metadata. The VMSA page contains the guest register
state. I am more concern about the pages allocated by the KVM for the
VMSA and firmware. These pages are not a guest private per se.  To avoid
getting into this situation we can probably create SNP buffer pool. All
the firmware and VMSA pages should come from this pool.

Another challenging one, KVM maps a guest page and does write to it. One
such example is the GHCB page. If the mapped address points to a PMD
entry then we will get an RMP violation.


> ISTM we should fully unmap any guest private page from the kernel and all host user pagetables before actually making it be a guest private page.
Borislav Petkov April 20, 2021, 9:47 a.m. UTC | #10
On Mon, Apr 19, 2021 at 12:46:53PM -0500, Brijesh Singh wrote:
> - KVM calls  alloc_page() to allocate a VMSA page. The allocator returns
> 0xffffc80000200000 (PFN 0x200, page-level=2M). The VMSA page is private
> page so KVM will call RMPUPDATE to add the page as a private page in the
> RMP table. While adding the RMP entry the KVM will use the page level=4K.

Right, and *here* we split the 2M page on the *host* so that there's no
discrepancy between the host pagetable and the RMP. I guess your patch
does exactly that. :)

And AFAIR, set_memory.c doesn't have the functionality to coalesce 4K
pages back into the corresponding 2M page. Which means, the RMP table is
in sync, more or less.

Thx and thanks for elaborating.
Borislav Petkov April 20, 2021, 9:51 a.m. UTC | #11
On Mon, Apr 19, 2021 at 11:33:08AM -0700, Dave Hansen wrote:
> My point was just that you can't _easily_ do the 2M->4k kernel mapping
> demotion in a page fault handler, like I think Borislav was suggesting.

Yeah, see my reply to Brijesh. Not in the #PF handler but when the guest
does update the RMP table on page allocation, we should split the kernel
mapping too, so that it corresponds to what's being changed in the RMP
table.

Dunno how useful it would be if we also do coalescing of 4K pages into
their corresponding 2M pages... I haven't looked at set_memory.c for a
long time and have forgotten about all details...

In any case, my main goal here is to keep the tables in sync so that we
don't have to do crazy splitting in unsafe contexts like #PF. I hope I'm
making sense...
diff mbox series

Patch

diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 7a0138cb3e17..4047acb37c30 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -674,6 +674,12 @@  int rmptable_rmpupdate(struct page *page, struct rmpupdate *val)
 	if (!static_branch_unlikely(&snp_enable_key))
 		return -ENXIO;
 
+	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);
+	if (ret) {
+		pr_err("SEV-SNP: failed to split physical address 0x%lx (%d)\n", spa, ret);
+		return ret;
+	}
+
 	/* Retry if another processor is modifying the RMP entry. */
 	do {
 		asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"