Message ID | 20210324170436.31843-5-brijesh.singh@amd.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand |
On 3/24/21 10:04 AM, Brijesh Singh wrote: > The spliting of the physmap is a temporary solution until we work to > improve the kernel page fault handler to split the pages on demand. > One of the disadvtange of splitting is that eventually, we will end up > breaking down the entire physmap unless we combine the split pages back to > a large page. I am open to the suggestation on various approaches we could > take to address this problem. Other than suggesting that the hardware be fixed to do the fracturing itself? :) I suspect that this code is trying to be *too* generic. I would expect that very little of guest memory is actually shared with the host. It's also not going to be random guest pages. The guest and the host have to agree on these things, and I *think* the host is free to move the physical backing around once it's shared. So, let's say that there a guest->host paravirt interface where the guest says in advance, "I want to share this page." The host can split at *that* point and *only* split that one page's mapping. Any page faults would occur only if the host screws up, and would result in an oops. That also gives a point where the host can say, "nope, that hugetlbfs, I can't split it".
On Wed, Mar 24, 2021 at 12:04:10PM -0500, Brijesh Singh wrote: > A write from the hypervisor goes through the RMP checks. When the > hypervisor writes to pages, hardware checks to ensures that the assigned > bit in the RMP is zero (i.e page is shared). If the page table entry that > gives the sPA indicates that the target page size is a large page, then > all RMP entries for the 4KB constituting pages of the target must have the > assigned bit 0. Hmm, so this is important: I read this such that we can have a 2M page table entry but the RMP table can contain 4K entries for the corresponding 512 4K pages. Is that correct? If so, then there's a certain discrepancy here and I'd expect that if the page gets split/collapsed, depending on the result, the RMP table should be updated too, so that it remains in sync. For example: * mm decides to group all 512 4K entries into a 2M entry, RMP table gets updated in the end to reflect that * mm decides to split a page, RMP table gets updated too, for the same reason. In this way, RMP table will be always in sync with the pagetables. I know, I probably am missing something but that makes most sense to me instead of noticing the discrepancy and getting to work then, when handling the RMP violation. Or? Thx.
On 4/19/21 7:32 AM, Borislav Petkov wrote: > On Wed, Mar 24, 2021 at 12:04:10PM -0500, Brijesh Singh wrote: >> A write from the hypervisor goes through the RMP checks. When the >> hypervisor writes to pages, hardware checks to ensures that the assigned >> bit in the RMP is zero (i.e page is shared). If the page table entry that >> gives the sPA indicates that the target page size is a large page, then >> all RMP entries for the 4KB constituting pages of the target must have the >> assigned bit 0. > Hmm, so this is important: I read this such that we can have a 2M > page table entry but the RMP table can contain 4K entries for the > corresponding 512 4K pages. Is that correct? Yes that is correct. > > If so, then there's a certain discrepancy here and I'd expect that if > the page gets split/collapsed, depending on the result, the RMP table > should be updated too, so that it remains in sync. Yes that is correct. For write access to succeed we need both the x86 and RMP page tables in sync. > > For example: > > * mm decides to group all 512 4K entries into a 2M entry, RMP table gets > updated in the end to reflect that To my understanding, we don't group 512 4K entries into a 2M for the kernel address range. We do this for the userspace address through khugepage daemon. If page tables get out of sync then it will cause an RMP violation, the Patch #7 adds support to split the pages on demand. > > * mm decides to split a page, RMP table gets updated too, for the same > reason. > > In this way, RMP table will be always in sync with the pagetables. > > I know, I probably am missing something but that makes most sense to > me instead of noticing the discrepancy and getting to work then, when > handling the RMP violation. > > Or? > > Thx. >
On Mon, Apr 19, 2021 at 10:25:01AM -0500, Brijesh Singh wrote: > To my understanding, we don't group 512 4K entries into a 2M for the > kernel address range. We do this for the userspace address through > khugepage daemon. If page tables get out of sync then it will cause an > RMP violation, the Patch #7 adds support to split the pages on demand. Ok. So I haven't reviewed the whole thing but, is it possible to keep the RMP table in sync so that you don't have to split the physmap like you do in this patch? I.e., if the physmap page is 2M, then you have a corresponding RMP entry of 2M so that you don't have to split. And if you have 4K, then the corresponding RMP entry is 4K. You get the idea... IOW, when does that happen: "During the page table walk, we may get into the situation where one of the pages within the large page is owned by the guest (i.e assigned bit is set in RMP)." In which case is a 4K page - as part of a 2M physmap mapping - owned by a guest? Thx.
On 4/19/21 10:46 AM, Brijesh Singh wrote: > - guest wants to make gpa 0x1000 as a shared page. To support this, we > need to psmash the large RMP entry into 512 4K entries. The psmash > instruction breaks the large RMP entry into 512 4K entries without > affecting the previous validation. Now the we need to force the host to > use the 4K page level instead of the 2MB. > > To my understanding, Linux kernel fault handler does not build the page > tables on demand for the kernel addresses. All kernel addresses are > pre-mapped on the boot. Currently, I am proactively spitting the physmap > to avoid running into situation where x86 page level is greater than the > RMP page level. In other words, if the host maps guest memory with 2M mappings, the guest can induce page faults in the host. The only way the host can avoid this is to map everything with 4k mappings. If the host does not avoid this, it could end up in the situation where it gets page faults on access to kernel data structures. Imagine if a kernel stack page ended up in the same 2M mapping as a guest page. I *think* the next write to the kernel stack would end up double-faulting.
> On Apr 19, 2021, at 10:58 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/19/21 10:46 AM, Brijesh Singh wrote: >> - guest wants to make gpa 0x1000 as a shared page. To support this, we >> need to psmash the large RMP entry into 512 4K entries. The psmash >> instruction breaks the large RMP entry into 512 4K entries without >> affecting the previous validation. Now the we need to force the host to >> use the 4K page level instead of the 2MB. >> >> To my understanding, Linux kernel fault handler does not build the page >> tables on demand for the kernel addresses. All kernel addresses are >> pre-mapped on the boot. Currently, I am proactively spitting the physmap >> to avoid running into situation where x86 page level is greater than the >> RMP page level. > > In other words, if the host maps guest memory with 2M mappings, the > guest can induce page faults in the host. The only way the host can > avoid this is to map everything with 4k mappings. > > If the host does not avoid this, it could end up in the situation where > it gets page faults on access to kernel data structures. Imagine if a > kernel stack page ended up in the same 2M mapping as a guest page. I > *think* the next write to the kernel stack would end up double-faulting. I’m confused by this scenario. This should only affect physical pages that are in the 2M area that contains guest memory. But, if we have a 2M direct map PMD entry that contains kernel data and guest private memory, we’re already in a situation in which the kernel touching that memory would machine check, right? ISTM we should fully unmap any guest private page from the kernel and all host user pagetables before actually making it be a guest private page.
On 4/19/21 11:10 AM, Andy Lutomirski wrote: > I’m confused by this scenario. This should only affect physical pages > that are in the 2M area that contains guest memory. But, if we have a > 2M direct map PMD entry that contains kernel data and guest private > memory, we’re already in a situation in which the kernel touching > that memory would machine check, right? Not machine check, but page fault. Do machine checks even play a special role in SEV-SNP? I thought that was only TDX? My point was just that you can't _easily_ do the 2M->4k kernel mapping demotion in a page fault handler, like I think Borislav was suggesting. > ISTM we should fully unmap any guest private page from the kernel and > all host user pagetables before actually making it be a guest private > page. Yes, that sounds attractive. Then, we'd actually know if the host kernel was doing stray reads somehow because we'd get a fault there too.
> On Apr 19, 2021, at 11:33 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/19/21 11:10 AM, Andy Lutomirski wrote: >> I’m confused by this scenario. This should only affect physical pages >> that are in the 2M area that contains guest memory. But, if we have a >> 2M direct map PMD entry that contains kernel data and guest private >> memory, we’re already in a situation in which the kernel touching >> that memory would machine check, right? > > Not machine check, but page fault. Do machine checks even play a > special role in SEV-SNP? I thought that was only TDX? Brain fart. > > My point was just that you can't _easily_ do the 2M->4k kernel mapping > demotion in a page fault handler, like I think Borislav was suggesting. We are certainly toast if this hits the stack. Or if it hits a page table or the GDT or IDT :). The latter delightful choices would be triple faults. I sure hope the code we use to split a mapping is properly NMI safe. > >> ISTM we should fully unmap any guest private page from the kernel and >> all host user pagetables before actually making it be a guest private >> page. > > Yes, that sounds attractive. Then, we'd actually know if the host > kernel was doing stray reads somehow because we'd get a fault there too.
On 4/19/21 1:10 PM, Andy Lutomirski wrote: > >> On Apr 19, 2021, at 10:58 AM, Dave Hansen <dave.hansen@intel.com> wrote: >> >> On 4/19/21 10:46 AM, Brijesh Singh wrote: >>> - guest wants to make gpa 0x1000 as a shared page. To support this, we >>> need to psmash the large RMP entry into 512 4K entries. The psmash >>> instruction breaks the large RMP entry into 512 4K entries without >>> affecting the previous validation. Now the we need to force the host to >>> use the 4K page level instead of the 2MB. >>> >>> To my understanding, Linux kernel fault handler does not build the page >>> tables on demand for the kernel addresses. All kernel addresses are >>> pre-mapped on the boot. Currently, I am proactively spitting the physmap >>> to avoid running into situation where x86 page level is greater than the >>> RMP page level. >> In other words, if the host maps guest memory with 2M mappings, the >> guest can induce page faults in the host. The only way the host can >> avoid this is to map everything with 4k mappings. >> >> If the host does not avoid this, it could end up in the situation where >> it gets page faults on access to kernel data structures. Imagine if a >> kernel stack page ended up in the same 2M mapping as a guest page. I >> *think* the next write to the kernel stack would end up double-faulting. > I’m confused by this scenario. This should only affect physical pages that are in the 2M area that contains guest memory. But, if we have a 2M direct map PMD entry that contains kernel data and guest private memory, we’re already in a situation in which the kernel touching that memory would machine check, right? When SEV-SNP is enabled in the host, a page can be in one of the following state: 1. Hypevisor (assigned = 0, Validated=0) 2. Firmware (assigned = 1, immutable=1) 3. Context/VMSA (assigned=1, vmsa=1) 4. Guest private (assigned = 1, Validated=1) You are right that we should never run into situation where the kernel data and guest page will be in the same PMD entry. During the SEV-VM creation, KVM allocates one firmware page and one vmsa page for each vcpus. The firmware page is used by the SEV-SNP firmware to keep some private metadata. The VMSA page contains the guest register state. I am more concern about the pages allocated by the KVM for the VMSA and firmware. These pages are not a guest private per se. To avoid getting into this situation we can probably create SNP buffer pool. All the firmware and VMSA pages should come from this pool. Another challenging one, KVM maps a guest page and does write to it. One such example is the GHCB page. If the mapped address points to a PMD entry then we will get an RMP violation. > ISTM we should fully unmap any guest private page from the kernel and all host user pagetables before actually making it be a guest private page.
On Mon, Apr 19, 2021 at 12:46:53PM -0500, Brijesh Singh wrote: > - KVM calls alloc_page() to allocate a VMSA page. The allocator returns > 0xffffc80000200000 (PFN 0x200, page-level=2M). The VMSA page is private > page so KVM will call RMPUPDATE to add the page as a private page in the > RMP table. While adding the RMP entry the KVM will use the page level=4K. Right, and *here* we split the 2M page on the *host* so that there's no discrepancy between the host pagetable and the RMP. I guess your patch does exactly that. :) And AFAIR, set_memory.c doesn't have the functionality to coalesce 4K pages back into the corresponding 2M page. Which means, the RMP table is in sync, more or less. Thx and thanks for elaborating.
On Mon, Apr 19, 2021 at 11:33:08AM -0700, Dave Hansen wrote: > My point was just that you can't _easily_ do the 2M->4k kernel mapping > demotion in a page fault handler, like I think Borislav was suggesting. Yeah, see my reply to Brijesh. Not in the #PF handler but when the guest does update the RMP table on page allocation, we should split the kernel mapping too, so that it corresponds to what's being changed in the RMP table. Dunno how useful it would be if we also do coalescing of 4K pages into their corresponding 2M pages... I haven't looked at set_memory.c for a long time and have forgotten about all details... In any case, my main goal here is to keep the tables in sync so that we don't have to do crazy splitting in unsafe contexts like #PF. I hope I'm making sense...
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c index 7a0138cb3e17..4047acb37c30 100644 --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -674,6 +674,12 @@ int rmptable_rmpupdate(struct page *page, struct rmpupdate *val) if (!static_branch_unlikely(&snp_enable_key)) return -ENXIO; + ret = set_memory_4k((unsigned long)page_to_virt(page), 1); + if (ret) { + pr_err("SEV-SNP: failed to split physical address 0x%lx (%d)\n", spa, ret); + return ret; + } + /* Retry if another processor is modifying the RMP entry. */ do { asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
The integrity guarantee of SEV-SNP is enforced through the RMP table. The RMP is used in conjuntion with standard x86 and IOMMU page tables to enforce memory restrictions and page access rights. The RMP is indexed by system physical address, and is checked at the end of CPU and IOMMU table walks. The RMP check is enforced as soon as SEV-SNP is enabled globally in the system. Not every memory access requires an RMP check. In particular, the read accesses from the hypervisor do not require RMP checks because the data confidentiality is already protected via memory encryption. When hardware encounters an RMP checks failure, it raise a page-fault exception. The RMP bit in fault error code can be used to determine if the fault was due to an RMP checks failure. A write from the hypervisor goes through the RMP checks. When the hypervisor writes to pages, hardware checks to ensures that the assigned bit in the RMP is zero (i.e page is shared). If the page table entry that gives the sPA indicates that the target page size is a large page, then all RMP entries for the 4KB constituting pages of the target must have the assigned bit 0. If one of entry does not have assigned bit 0 then hardware will raise an RMP violation. To resolve it, we must split the page table entry leading to target page into 4K. This poses a challenge in the Linux memory model. The Linux kernel creates a direct mapping of all the physical memory -- referred to as the physmap. The physmap may contain a valid mapping of guest owned pages. During the page table walk, we may get into the situation where one of the pages within the large page is owned by the guest (i.e assigned bit is set in RMP). A write to a non-guest within the large page will raise an RMP violation. To workaround it, we call set_memory_4k() to split the physmap before adding the page in the RMP table. This ensures that the pages added in the RMP table are used as 4K in the physmap. The spliting of the physmap is a temporary solution until we work to improve the kernel page fault handler to split the pages on demand. One of the disadvtange of splitting is that eventually, we will end up breaking down the entire physmap unless we combine the split pages back to a large page. I am open to the suggestation on various approaches we could take to address this problem. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Joerg Roedel <jroedel@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: David Rientjes <rientjes@google.com> Cc: Sean Christopherson <seanjc@google.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> --- arch/x86/mm/mem_encrypt.c | 6 ++++++ 1 file changed, 6 insertions(+)