diff mbox

x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs

Message ID 1491397622-16665-1-git-send-email-sironi@amazon.de (mailing list archive)
State New, archived
Headers show

Commit Message

Sironi, Filippo April 5, 2017, 1:07 p.m. UTC
cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
pages and the respective struct pages for mapping in the kernel virtual
address space.
This doesn't work if get_user_pages_fast() is invoked with a userspace
virtual address that's backed by PFNs outside of kernel reach (e.g.,
when limiting the kernel memory with mem= in the command line and using
/dev/mem to map memory).

If get_user_pages_fast() fails, look up the VMA that backs the userspace
virtual address, compute the PFN and the physical address, and map it in
the kernel virtual address space with memremap().

Signed-off-by: Filippo Sironi <sironi@amazon.de>
Cc: Anthony Liguori <aliguori@amazon.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kvm/paging_tmpl.h | 39 ++++++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

Comments

Radim Krčmář April 6, 2017, 2:22 p.m. UTC | #1
2017-04-05 15:07+0200, Filippo Sironi:
> cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
> pages and the respective struct pages for mapping in the kernel virtual
> address space.
> This doesn't work if get_user_pages_fast() is invoked with a userspace
> virtual address that's backed by PFNs outside of kernel reach (e.g.,
> when limiting the kernel memory with mem= in the command line and using
> /dev/mem to map memory).
> 
> If get_user_pages_fast() fails, look up the VMA that backs the userspace
> virtual address, compute the PFN and the physical address, and map it in
> the kernel virtual address space with memremap().

What is the reason for a configuration that voluntarily restricts access
to memory that it needs?

> Signed-off-by: Filippo Sironi <sironi@amazon.de>
> Cc: Anthony Liguori <aliguori@amazon.com>
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> @@ -147,15 +147,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  	struct page *page;
>  
>  	npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, &page);
> -	/* Check if the user is doing something meaningless. */
> -	if (unlikely(npages != 1))
> -		return -EFAULT;
> -
> -	table = kmap_atomic(page);
> -	ret = CMPXCHG(&table[index], orig_pte, new_pte);
> -	kunmap_atomic(table);
> -
> -	kvm_release_page_dirty(page);
> +	if (likely(npages == 1)) {
> +		table = kmap_atomic(page);
> +		ret = CMPXCHG(&table[index], orig_pte, new_pte);
> +		kunmap_atomic(table);
> +
> +		kvm_release_page_dirty(page);
> +	} else {
> +		struct vm_area_struct *vma;
> +		unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;
> +		unsigned long pfn;
> +		unsigned long paddr;
> +
> +		down_read(&current->mm->mmap_sem);
> +		vma = find_vma_intersection(current->mm, vaddr,
> +					    vaddr + PAGE_SIZE);

Hm, with the argument order like this, we check that

  vaddr < vma->vm_end && vaddr + PAGE_SIZE > vma->vm_start

but shouldn't we actually check that the whole page is there, i.e.

  vaddr + PAGE_SIZE < vma->vm_end && vaddr > vma->vm_start

?

Thanks.

> +		if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
> +			up_read(&current->mm->mmap_sem);
> +			return -EFAULT;
> +		}
> +		pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +		paddr = pfn << PAGE_SHIFT;
> +		table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);

(I don't undestand why there isn't a wrapper for this ...
 Looks like we're doing something unexpected.)

> +		if (!table) {
> +			up_read(&current->mm->mmap_sem);
> +			return -EFAULT;
> +		}
> +		ret = CMPXCHG(&table[index], orig_pte, new_pte);
> +		memunmap(table);
> +		up_read(&current->mm->mmap_sem);
> +	}
>  
>  	return (ret != orig_pte);
>  }
> -- 
> 2.7.4
>
Sironi, Filippo April 12, 2017, 1:16 p.m. UTC | #2
Thanks for taking the time and sorry for the delay.

> On 6. Apr 2017, at 16:22, Radim Krčmář <rkrcmar@redhat.com> wrote:

> 

> 2017-04-05 15:07+0200, Filippo Sironi:

>> cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of

>> pages and the respective struct pages for mapping in the kernel virtual

>> address space.

>> This doesn't work if get_user_pages_fast() is invoked with a userspace

>> virtual address that's backed by PFNs outside of kernel reach (e.g.,

>> when limiting the kernel memory with mem= in the command line and using

>> /dev/mem to map memory).

>> 

>> If get_user_pages_fast() fails, look up the VMA that backs the userspace

>> virtual address, compute the PFN and the physical address, and map it in

>> the kernel virtual address space with memremap().

> 

> What is the reason for a configuration that voluntarily restricts access

> to memory that it needs?


By using /dev/mem to provide VM memory, one can avoid the overhead of allocating struct page(s) for the whole memory, which is wasteful when using a server entirely for hosting VMs.

>> Signed-off-by: Filippo Sironi <sironi@amazon.de>

>> Cc: Anthony Liguori <aliguori@amazon.com>

>> Cc: kvm@vger.kernel.org

>> Cc: linux-kernel@vger.kernel.org

>> ---

>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h

>> @@ -147,15 +147,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,

>> 	struct page *page;

>> 

>> 	npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, &page);

>> -	/* Check if the user is doing something meaningless. */

>> -	if (unlikely(npages != 1))

>> -		return -EFAULT;

>> -

>> -	table = kmap_atomic(page);

>> -	ret = CMPXCHG(&table[index], orig_pte, new_pte);

>> -	kunmap_atomic(table);

>> -

>> -	kvm_release_page_dirty(page);

>> +	if (likely(npages == 1)) {

>> +		table = kmap_atomic(page);

>> +		ret = CMPXCHG(&table[index], orig_pte, new_pte);

>> +		kunmap_atomic(table);

>> +

>> +		kvm_release_page_dirty(page);

>> +	} else {

>> +		struct vm_area_struct *vma;

>> +		unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;

>> +		unsigned long pfn;

>> +		unsigned long paddr;

>> +

>> +		down_read(&current->mm->mmap_sem);

>> +		vma = find_vma_intersection(current->mm, vaddr,

>> +					    vaddr + PAGE_SIZE);

> 

> Hm, with the argument order like this, we check that

> 

>  vaddr < vma->vm_end && vaddr + PAGE_SIZE > vma->vm_start

> 

> but shouldn't we actually check that the whole page is there, i.e.

> 

>  vaddr + PAGE_SIZE < vma->vm_end && vaddr > vma->vm_start

> 

> ?

> 

> Thanks.


Hm, right now we check for the following:

    vaddr >= vma->vm_start && vaddr < vma->vm_end && vaddr + PAGE_SIZE > vma->vm_start

given that vaddr is PAGE_SIZE aligned, we're guaranteed that vaddr + PAGE_SIZE <= vma->vm_end.
This seems more complex than necessary. I believe that:

    vma = find_vma(current->mm, vaddr);

should be enough.

>> +		if (!vma || !(vma->vm_flags & VM_PFNMAP)) {

>> +			up_read(&current->mm->mmap_sem);

>> +			return -EFAULT;

>> +		}

>> +		pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

>> +		paddr = pfn << PAGE_SHIFT;

>> +		table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);

> 

> (I don't undestand why there isn't a wrapper for this ...

> Looks like we're doing something unexpected.)


Do you mean a wrapper for getting the pfn/paddr?

>> +		if (!table) {

>> +			up_read(&current->mm->mmap_sem);

>> +			return -EFAULT;

>> +		}

>> +		ret = CMPXCHG(&table[index], orig_pte, new_pte);

>> +		memunmap(table);

>> +		up_read(&current->mm->mmap_sem);

>> +	}

>> 

>> 	return (ret != orig_pte);

>> }

>> -- 

>> 2.7.4


I'll submit a v2 version soon.

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
Xiao Guangrong April 17, 2017, 11:26 a.m. UTC | #3
On 04/12/2017 09:16 PM, Sironi, Filippo wrote:
> Thanks for taking the time and sorry for the delay.
>
>> On 6. Apr 2017, at 16:22, Radim Krčmář <rkrcmar@redhat.com> wrote:
>>
>> 2017-04-05 15:07+0200, Filippo Sironi:
>>> cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
>>> pages and the respective struct pages for mapping in the kernel virtual
>>> address space.
>>> This doesn't work if get_user_pages_fast() is invoked with a userspace
>>> virtual address that's backed by PFNs outside of kernel reach (e.g.,
>>> when limiting the kernel memory with mem= in the command line and using
>>> /dev/mem to map memory).
>>>
>>> If get_user_pages_fast() fails, look up the VMA that backs the userspace
>>> virtual address, compute the PFN and the physical address, and map it in
>>> the kernel virtual address space with memremap().
>>
>> What is the reason for a configuration that voluntarily restricts access
>> to memory that it needs?
>
> By using /dev/mem to provide VM memory, one can avoid the overhead of allocating struct page(s) for the whole memory, which is wasteful when using a server entirely for hosting VMs.
>

Sounds reasonable, however it is incomplete so far as there are some
code paths still do not support non-page backend memory, e.g,
emulator_cmpxchg_emulated().

I would suggest to unify the code introduced in this patch with existing
hva_to_pfn(), also we can introduce a common API, maybe named
kvm_map_hva(), to improve the caller sides.


BTW, i do not know why we used kmap_atomic() rather than kmap(), the
path of cmpxchg_gpte() is sleep-able anyway.

Thanks!
Paolo Bonzini April 18, 2017, 8:12 a.m. UTC | #4
On 17/04/2017 13:26, Xiao Guangrong wrote:
>>
> 
> Sounds reasonable, however it is incomplete so far as there are some
> code paths still do not support non-page backend memory, e.g,
> emulator_cmpxchg_emulated().
> 
> I would suggest to unify the code introduced in this patch with existing
> hva_to_pfn(), also we can introduce a common API, maybe named
> kvm_map_hva(), to improve the caller sides.

I agree with Guangrong's suggestion.

Paolo

> BTW, i do not know why we used kmap_atomic() rather than kmap(), the
> path of cmpxchg_gpte() is sleep-able anyway.
diff mbox

Patch

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index a01105485315..ab4d6617238c 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -147,15 +147,36 @@  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	struct page *page;
 
 	npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, &page);
-	/* Check if the user is doing something meaningless. */
-	if (unlikely(npages != 1))
-		return -EFAULT;
-
-	table = kmap_atomic(page);
-	ret = CMPXCHG(&table[index], orig_pte, new_pte);
-	kunmap_atomic(table);
-
-	kvm_release_page_dirty(page);
+	if (likely(npages == 1)) {
+		table = kmap_atomic(page);
+		ret = CMPXCHG(&table[index], orig_pte, new_pte);
+		kunmap_atomic(table);
+
+		kvm_release_page_dirty(page);
+	} else {
+		struct vm_area_struct *vma;
+		unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;
+		unsigned long pfn;
+		unsigned long paddr;
+
+		down_read(&current->mm->mmap_sem);
+		vma = find_vma_intersection(current->mm, vaddr,
+					    vaddr + PAGE_SIZE);
+		if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
+			up_read(&current->mm->mmap_sem);
+			return -EFAULT;
+		}
+		pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+		paddr = pfn << PAGE_SHIFT;
+		table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);
+		if (!table) {
+			up_read(&current->mm->mmap_sem);
+			return -EFAULT;
+		}
+		ret = CMPXCHG(&table[index], orig_pte, new_pte);
+		memunmap(table);
+		up_read(&current->mm->mmap_sem);
+	}
 
 	return (ret != orig_pte);
 }