Message ID | 20201020061859.18385-16-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM protected memory extension | expand |
On Tue, Oct 20, 2020 at 09:18:58AM +0300, Kirill A. Shutemov wrote: > If the protected memory feature enabled, unmap guest memory from > kernel's direct mappings. > > Migration and KSM is disabled for protected memory as it would require a > special treatment. How isn't disabling migration a problem? Are we going to back allocation by CMA instead?
On 20.10.20 08:18, Kirill A. Shutemov wrote: > If the protected memory feature enabled, unmap guest memory from > kernel's direct mappings. Gah, ugly. I guess this also defeats compaction, swapping, ... oh gosh. As if all of the encrypted VM implementations didn't bring us enough ugliness already (SEV extensions also don't support reboots, but can at least kexec() IIRC). Something similar is done with secretmem [1]. And people don't seem to like fragmenting the direct mapping (including me). [1] https://lkml.kernel.org/r/20200924132904.1391-1-rppt@kernel.org
On 20.10.20 14:18, David Hildenbrand wrote: > On 20.10.20 08:18, Kirill A. Shutemov wrote: >> If the protected memory feature enabled, unmap guest memory from >> kernel's direct mappings. > > Gah, ugly. I guess this also defeats compaction, swapping, ... oh gosh. > As if all of the encrypted VM implementations didn't bring us enough > ugliness already (SEV extensions also don't support reboots, but can at > least kexec() IIRC). > > Something similar is done with secretmem [1]. And people don't seem to > like fragmenting the direct mapping (including me). > > [1] https://lkml.kernel.org/r/20200924132904.1391-1-rppt@kernel.org > I just thought "hey, we might have to replace pud/pmd mappings by page tables when calling kernel_map_pages", this can fail with -ENOMEM, why isn't there proper error handling. Then I dived into __kernel_map_pages() which states: "The return value is ignored as the calls cannot fail. Large pages for identity mappings are not used at boot time and hence no memory allocations during large page split." I am probably missing something important, but how is calling kernel_map_pages() safe *after* booting?! I know we use it for debug_pagealloc(), but using it in a production-ready feature feels completely irresponsible. What am I missing?
On Tue, 2020-10-20 at 15:20 +0200, David Hildenbrand wrote: > On 20.10.20 14:18, David Hildenbrand wrote: > > On 20.10.20 08:18, Kirill A. Shutemov wrote: > > > If the protected memory feature enabled, unmap guest memory from > > > kernel's direct mappings. > > > > Gah, ugly. I guess this also defeats compaction, swapping, ... oh > > gosh. > > As if all of the encrypted VM implementations didn't bring us > > enough > > ugliness already (SEV extensions also don't support reboots, but > > can at > > least kexec() IIRC). > > > > Something similar is done with secretmem [1]. And people don't seem > > to > > like fragmenting the direct mapping (including me). > > > > [1] https://lkml.kernel.org/r/20200924132904.1391-1-rppt@kernel.org > > > > I just thought "hey, we might have to replace pud/pmd mappings by > page > tables when calling kernel_map_pages", this can fail with -ENOMEM, > why > isn't there proper error handling. > > Then I dived into __kernel_map_pages() which states: > > "The return value is ignored as the calls cannot fail. Large pages > for > identity mappings are not used at boot time and hence no memory > allocations during large page split." > > I am probably missing something important, but how is calling > kernel_map_pages() safe *after* booting?! I know we use it for > debug_pagealloc(), but using it in a production-ready feature feels > completely irresponsible. What am I missing? > My understanding was that DEBUG_PAGEALLOC maps the direct map at 4k post-boot as well so that kernel_map_pages() can't fail for it, and to avoid having to take locks for splits in interrupts. I think you are on to something though. That function is essentially a set_memory_() functions on x86 at least, and many callers do not check the error codes. Kees Cook has been pushing to fix this actually: https://github.com/KSPP/linux/issues/7 As long as a page has been broken to 4k, it should be able to be re- mapped without failure (like debug page alloc relies on). If an initial call to restrict permissions needs and fails to break a page, its remap does not need to succeed either before freeing the page, since the page never got set with a lower permission. That's my understanding from x86's cpa at least. The problem becomes that the page silently doesn't have its expected protections during use. Unchecked set_memory_() caching property change failures might result in functional problems though. So there is some background if you wanted it, but yea, I agree this feature should handle if the unmap failed. Probably return an error on the IOCTL and maybe the hypercall. kernel_map_pages() also doesn't do a shootdown though, so this direct map protection is more in the best effort category in its current state. For unmapping causing direct map fragmentation. The two techniques floating around, besides breaking indiscriminately, seem to be: 1. Group and cache unmapped pages to minimize the damage (like secret mem) 2. Somehow repair them back to large pages when reset RW (as Kirill had tried earlier this year in another series) I had imagined this usage would want something like that eventually, but neither helps with the other limitations you mentioned (migration, etc).
On Tue, 2020-10-20 at 09:18 +0300, Kirill A. Shutemov wrote: > If the protected memory feature enabled, unmap guest memory from > kernel's direct mappings. > > Migration and KSM is disabled for protected memory as it would > require a > special treatment. > So do we care about this scenario where a malicious userspace causes a kernel oops? I'm not sure if it's prevented somehow. CPU0 (exercising other kernel functionality) CPU1 mark page shared page = get_user_pages(!FOLL_KVM) mark page private kmap(page) access unmapped page and oops
On Tue, Oct 20, 2020 at 09:18:58AM +0300, Kirill A. Shutemov wrote: > If the protected memory feature enabled, unmap guest memory from > kernel's direct mappings. > > Migration and KSM is disabled for protected memory as it would require a > special treatment. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > --- > include/linux/mm.h | 3 +++ > mm/huge_memory.c | 8 ++++++++ > mm/ksm.c | 2 ++ > mm/memory.c | 12 ++++++++++++ > mm/rmap.c | 4 ++++ > virt/lib/mem_protected.c | 21 +++++++++++++++++++++ > 6 files changed, 50 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ee274d27e764..74efc51e63f0 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -671,6 +671,9 @@ static inline bool vma_is_kvm_protected(struct vm_area_struct *vma) > return vma->vm_flags & VM_KVM_PROTECTED; > } > > +void kvm_map_page(struct page *page, int nr_pages); > +void kvm_unmap_page(struct page *page, int nr_pages); This still does not seem right ;-) And I still think that map/unmap primitives shoud be a part of the generic mm rather than exported by KVM. > + > #ifdef CONFIG_SHMEM > /* > * The vma_is_shmem is not inline because it is used only by slow > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index ec8cf9a40cfd..40974656cb43 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -627,6 +627,10 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, > spin_unlock(vmf->ptl); > count_vm_event(THP_FAULT_ALLOC); > count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); > + > + /* Unmap page from direct mapping */ > + if (vma_is_kvm_protected(vma)) > + kvm_unmap_page(page, HPAGE_PMD_NR); > } > > return 0; > @@ -1689,6 +1693,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > page_remove_rmap(page, true); > VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); > VM_BUG_ON_PAGE(!PageHead(page), page); > + > + /* Map the page back to the direct mapping */ > + if (vma_is_kvm_protected(vma)) > + kvm_map_page(page, HPAGE_PMD_NR); > } else if (thp_migration_supported()) { > swp_entry_t entry; > > diff --git a/mm/ksm.c b/mm/ksm.c > index 9afccc36dbd2..c720e271448f 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -528,6 +528,8 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm, > return NULL; > if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) > return NULL; > + if (vma_is_kvm_protected(vma)) > + return NULL; > return vma; > } > > diff --git a/mm/memory.c b/mm/memory.c > index 2c9756b4e52f..e28bd5f902a7 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1245,6 +1245,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > likely(!(vma->vm_flags & VM_SEQ_READ))) > mark_page_accessed(page); > } > + > + /* Map the page back to the direct mapping */ > + if (vma_is_anonymous(vma) && vma_is_kvm_protected(vma)) > + kvm_map_page(page, 1); > + > rss[mm_counter(page)]--; > page_remove_rmap(page, false); > if (unlikely(page_mapcount(page) < 0)) > @@ -3466,6 +3471,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > struct page *page; > vm_fault_t ret = 0; > pte_t entry; > + bool set = false; > > /* File mapping without ->vm_ops ? */ > if (vma->vm_flags & VM_SHARED) > @@ -3554,6 +3560,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); > page_add_new_anon_rmap(page, vma, vmf->address, false); > lru_cache_add_inactive_or_unevictable(page, vma); > + set = true; > setpte: > set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > > @@ -3561,6 +3568,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > update_mmu_cache(vma, vmf->address, vmf->pte); > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl); > + > + /* Unmap page from direct mapping */ > + if (vma_is_kvm_protected(vma) && set) > + kvm_unmap_page(page, 1); > + > return ret; > release: > put_page(page); > diff --git a/mm/rmap.c b/mm/rmap.c > index 9425260774a1..247548d6d24b 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1725,6 +1725,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg) > { > + /* TODO */ > + if (vma_is_kvm_protected(vma)) > + return true; > + > return vma_is_temporary_stack(vma); > } > > diff --git a/virt/lib/mem_protected.c b/virt/lib/mem_protected.c > index 1dfe82534242..9d2ef99285e5 100644 > --- a/virt/lib/mem_protected.c > +++ b/virt/lib/mem_protected.c > @@ -30,6 +30,27 @@ void kvm_unmap_page_atomic(void *vaddr) > } > EXPORT_SYMBOL_GPL(kvm_unmap_page_atomic); > > +void kvm_map_page(struct page *page, int nr_pages) > +{ > + int i; > + > + /* Clear page before returning it to the direct mapping */ > + for (i = 0; i < nr_pages; i++) { > + void *p = kvm_map_page_atomic(page + i); > + memset(p, 0, PAGE_SIZE); > + kvm_unmap_page_atomic(p); > + } > + > + kernel_map_pages(page, nr_pages, 1); > +} > +EXPORT_SYMBOL_GPL(kvm_map_page); > + > +void kvm_unmap_page(struct page *page, int nr_pages) > +{ > + kernel_map_pages(page, nr_pages, 0); > +} > +EXPORT_SYMBOL_GPL(kvm_unmap_page); > + > int kvm_init_protected_memory(void) > { > guest_map_ptes = kmalloc_array(num_possible_cpus(), > -- > 2.26.2 > >
On Fri, Oct 23, 2020 at 03:37:12PM +0300, Mike Rapoport wrote: > On Tue, Oct 20, 2020 at 09:18:58AM +0300, Kirill A. Shutemov wrote: > > If the protected memory feature enabled, unmap guest memory from > > kernel's direct mappings. > > > > Migration and KSM is disabled for protected memory as it would require a > > special treatment. > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > --- > > include/linux/mm.h | 3 +++ > > mm/huge_memory.c | 8 ++++++++ > > mm/ksm.c | 2 ++ > > mm/memory.c | 12 ++++++++++++ > > mm/rmap.c | 4 ++++ > > virt/lib/mem_protected.c | 21 +++++++++++++++++++++ > > 6 files changed, 50 insertions(+) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index ee274d27e764..74efc51e63f0 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -671,6 +671,9 @@ static inline bool vma_is_kvm_protected(struct vm_area_struct *vma) > > return vma->vm_flags & VM_KVM_PROTECTED; > > } > > > > +void kvm_map_page(struct page *page, int nr_pages); > > +void kvm_unmap_page(struct page *page, int nr_pages); > > This still does not seem right ;-) > > And I still think that map/unmap primitives shoud be a part of the > generic mm rather than exported by KVM. Ya, and going a step further, I suspect it will be cleaner in the long run if the kernel does not automatically map or unmap when converting between private and shared/public memory. Conversions will be rare in a well behaved guest, so exiting to userspace and forcing userspace to do the unmap->map would not be a performance bottleneck. In theory, userspace could also maintain separate pools for private vs. public mappings, though I doubt any VMM will do that in practice.
On 10/20/20 7:18 AM, David Hildenbrand wrote: > On 20.10.20 08:18, Kirill A. Shutemov wrote: >> If the protected memory feature enabled, unmap guest memory from >> kernel's direct mappings. > > Gah, ugly. I guess this also defeats compaction, swapping, ... oh gosh. > As if all of the encrypted VM implementations didn't bring us enough > ugliness already (SEV extensions also don't support reboots, but can at > least kexec() IIRC). SEV does support reboot. SEV-ES using Qemu doesn't support reboot because of the way Qemu resets the vCPU state. If Qemu could relaunch the guest through the SEV APIs to reset the vCPU state, then a "reboot" would be possible. SEV does support kexec, SEV-ES does not at the moment. Thanks, Tom > > Something similar is done with secretmem [1]. And people don't seem to > like fragmenting the direct mapping (including me). > > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20200924132904.1391-1-rppt%40kernel.org&data=04%7C01%7Cthomas.lendacky%40amd.com%7Cb98a5033da37432131b508d874f25194%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387931403890525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BzC%2FeIyOau7BORuUY%2BaiRzYZ%2BOAHANvBDcmV9hpkrts%3D&reserved=0 >
diff --git a/include/linux/mm.h b/include/linux/mm.h index ee274d27e764..74efc51e63f0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -671,6 +671,9 @@ static inline bool vma_is_kvm_protected(struct vm_area_struct *vma) return vma->vm_flags & VM_KVM_PROTECTED; } +void kvm_map_page(struct page *page, int nr_pages); +void kvm_unmap_page(struct page *page, int nr_pages); + #ifdef CONFIG_SHMEM /* * The vma_is_shmem is not inline because it is used only by slow diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ec8cf9a40cfd..40974656cb43 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -627,6 +627,10 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, spin_unlock(vmf->ptl); count_vm_event(THP_FAULT_ALLOC); count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + + /* Unmap page from direct mapping */ + if (vma_is_kvm_protected(vma)) + kvm_unmap_page(page, HPAGE_PMD_NR); } return 0; @@ -1689,6 +1693,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, page_remove_rmap(page, true); VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); VM_BUG_ON_PAGE(!PageHead(page), page); + + /* Map the page back to the direct mapping */ + if (vma_is_kvm_protected(vma)) + kvm_map_page(page, HPAGE_PMD_NR); } else if (thp_migration_supported()) { swp_entry_t entry; diff --git a/mm/ksm.c b/mm/ksm.c index 9afccc36dbd2..c720e271448f 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -528,6 +528,8 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm, return NULL; if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) return NULL; + if (vma_is_kvm_protected(vma)) + return NULL; return vma; } diff --git a/mm/memory.c b/mm/memory.c index 2c9756b4e52f..e28bd5f902a7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1245,6 +1245,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, likely(!(vma->vm_flags & VM_SEQ_READ))) mark_page_accessed(page); } + + /* Map the page back to the direct mapping */ + if (vma_is_anonymous(vma) && vma_is_kvm_protected(vma)) + kvm_map_page(page, 1); + rss[mm_counter(page)]--; page_remove_rmap(page, false); if (unlikely(page_mapcount(page) < 0)) @@ -3466,6 +3471,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) struct page *page; vm_fault_t ret = 0; pte_t entry; + bool set = false; /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) @@ -3554,6 +3560,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, vmf->address, false); lru_cache_add_inactive_or_unevictable(page, vma); + set = true; setpte: set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); @@ -3561,6 +3568,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) update_mmu_cache(vma, vmf->address, vmf->pte); unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); + + /* Unmap page from direct mapping */ + if (vma_is_kvm_protected(vma) && set) + kvm_unmap_page(page, 1); + return ret; release: put_page(page); diff --git a/mm/rmap.c b/mm/rmap.c index 9425260774a1..247548d6d24b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1725,6 +1725,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg) { + /* TODO */ + if (vma_is_kvm_protected(vma)) + return true; + return vma_is_temporary_stack(vma); } diff --git a/virt/lib/mem_protected.c b/virt/lib/mem_protected.c index 1dfe82534242..9d2ef99285e5 100644 --- a/virt/lib/mem_protected.c +++ b/virt/lib/mem_protected.c @@ -30,6 +30,27 @@ void kvm_unmap_page_atomic(void *vaddr) } EXPORT_SYMBOL_GPL(kvm_unmap_page_atomic); +void kvm_map_page(struct page *page, int nr_pages) +{ + int i; + + /* Clear page before returning it to the direct mapping */ + for (i = 0; i < nr_pages; i++) { + void *p = kvm_map_page_atomic(page + i); + memset(p, 0, PAGE_SIZE); + kvm_unmap_page_atomic(p); + } + + kernel_map_pages(page, nr_pages, 1); +} +EXPORT_SYMBOL_GPL(kvm_map_page); + +void kvm_unmap_page(struct page *page, int nr_pages) +{ + kernel_map_pages(page, nr_pages, 0); +} +EXPORT_SYMBOL_GPL(kvm_unmap_page); + int kvm_init_protected_memory(void) { guest_map_ptes = kmalloc_array(num_possible_cpus(),
If the protected memory feature enabled, unmap guest memory from kernel's direct mappings. Migration and KSM is disabled for protected memory as it would require a special treatment. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- include/linux/mm.h | 3 +++ mm/huge_memory.c | 8 ++++++++ mm/ksm.c | 2 ++ mm/memory.c | 12 ++++++++++++ mm/rmap.c | 4 ++++ virt/lib/mem_protected.c | 21 +++++++++++++++++++++ 6 files changed, 50 insertions(+)