diff mbox series

[RFC] KVM: remove the writable page for read fault case in hva_to_pfn_slow()

Message ID 20181008144138.41677-1-richard.weiyang@gmail.com (mailing list archive)
State New, archived
Headers show
Series [RFC] KVM: remove the writable page for read fault case in hva_to_pfn_slow() | expand

Commit Message

Wei Yang Oct. 8, 2018, 2:41 p.m. UTC
Case (!write_fault && writable) has been handled in hva_to_pfn_fast(),
it is not necessary to try again if hva_to_pfn_fast() already failed.

This patch removes this case in hva_to_pfn_slow().

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>

---

Hope my understanding is correct.

---
 virt/kvm/kvm_main.c | 10 ----------
 1 file changed, 10 deletions(-)

Comments

Sean Christopherson Oct. 8, 2018, 3:21 p.m. UTC | #1
On Mon, 2018-10-08 at 22:41 +0800, kvm-owner@vger.kernel.org wrote:
> Case (!write_fault && writable) has been handled in hva_to_pfn_fast(),
> it is not necessary to try again if hva_to_pfn_fast() already failed.
> 
> This patch removes this case in hva_to_pfn_slow().
> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> 
> ---
> 
> Hope my understanding is correct.
> 
> ---
>  virt/kvm/kvm_main.c | 10 ----------
>  1 file changed, 10 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1f42f1d474b5..c8fb3a9d81fa 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1403,16 +1403,6 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>  	if (npages != 1)
>  		return npages;
>  
> -	/* map read fault as writable if possible */
> -	if (unlikely(!write_fault) && writable) {
> -		struct page *wpage;
> -
> -		if (__get_user_pages_fast(addr, 1, 1, &wpage) == 1) {
> -			*writable = true;
> -			put_page(page);
> -			page = wpage;
> -		}
> -	}

This handles the scenario where __get_user_pages_fast() failed and
get_user_pages_unlocked() succeeded, obviously with !write_fault &&
writable.  On a read fault, get_user_pages_unlocked() requests the
page with read permissions to ensure it doesn't incorrectly fail due
to invalid access permissions, i.e. doesn't speculatively request
write permissions for a read-only VMA.  If that succeeds, we then
use __get_user_pages_fast() to test if the VMA is writable.  If the
VMA is indeed writable, we can create the shadow PTE with write
permissions even though we're handling a read fault.  Marking the
SPTE as writable means we don't take another page fault if/when the
guest writes the page, e.g. in a RMW scenario we take one page fault
instead of two.

>  	*pfn = page_to_pfn(page);
>  	return npages;
>  }
Wei Yang Oct. 8, 2018, 4:11 p.m. UTC | #2
On Mon, Oct 08, 2018 at 08:21:53AM -0700, Sean Christopherson wrote:
>On Mon, 2018-10-08 at 22:41 +0800, kvm-owner@vger.kernel.org wrote:
>> Case (!write_fault && writable) has been handled in hva_to_pfn_fast(),
>> it is not necessary to try again if hva_to_pfn_fast() already failed.
>> 
>> This patch removes this case in hva_to_pfn_slow().
>> 
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> 
>> ---
>> 
>> Hope my understanding is correct.
>> 
>> ---
>> ??virt/kvm/kvm_main.c | 10 ----------
>> ??1 file changed, 10 deletions(-)
>> 
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 1f42f1d474b5..c8fb3a9d81fa 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1403,16 +1403,6 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>> ??	if (npages != 1)
>> ??		return npages;
>> ??
>> -	/* map read fault as writable if possible */
>> -	if (unlikely(!write_fault) && writable) {
>> -		struct page *wpage;
>> -
>> -		if (__get_user_pages_fast(addr, 1, 1, &wpage) == 1) {
>> -			*writable = true;
>> -			put_page(page);
>> -			page = wpage;
>> -		}
>> -	}
>

Hi, Sean

Thanks for your comments, let me try to understand it.

>This handles the scenario where __get_user_pages_fast() failed and
>get_user_pages_unlocked() succeeded, obviously with !write_fault &&
>writable. ??On a read fault, get_user_pages_unlocked() requests the
>page with read permissions to ensure it doesn't incorrectly fail due
>to invalid access permissions, i.e. doesn't speculatively request
>write permissions for a read-only VMA. ??If that succeeds, we then

I think I can understand at this point.

Based on my understanding, we set *write* to 1 for __get_user_pages_fast()
which means we want a writable page. While in hva_to_pfn_slow(),
FOLL_WRITE is not set for get_user_pages_unlocked(), this means a
readable page would meet our need.

In case a read fault, if __get_user_pages_fast() failed, we still could
try get_user_pages_unlocked() to get a readable page.

>use __get_user_pages_fast() to test if the VMA is writable. ??If the
>VMA is indeed writable, we can create the shadow PTE with write
>permissions even though we're handling a read fault. ??Marking the

I can't follow up with you from this point.

__get_user_pages_fast() iterate the page table and check those
requirements. We pass the same parameter both in hva_to_pfn_slow() and
hva_to_pfn_fast(). Curious about why we would get different result if no
one else change the PTE?

BTW, you mentioned __get_user_pages_fast() would test the VMA? Would you
mind pointing which part does this? I am lost at this point.

>SPTE as writable means we don't take another page fault if/when the
>guest writes the page, e.g. in a RMW scenario we take one page fault
>instead of two.
>

Ah, thanks. I guess this is the benefits for this behavior. This reduces
one page fault by giving a writable page for a read fault.

>> ??	*pfn = page_to_pfn(page);
>> ??	return npages;
>> ??}
Sean Christopherson Oct. 9, 2018, 1:57 p.m. UTC | #3
On Mon, 2018-10-08 at 16:11 +0000, kvm-owner@vger.kernel.org wrote:
> On Mon, Oct 08, 2018 at 08:21:53AM -0700, Sean Christopherson wrote:
> > 
> > On Mon, 2018-10-08 at 22:41 +0800, kvm-owner@vger.kernel.org wrote:
> > > 
> > > Case (!write_fault && writable) has been handled in hva_to_pfn_fast(),
> > > it is not necessary to try again if hva_to_pfn_fast() already failed.
> > > 
> > > This patch removes this case in hva_to_pfn_slow().
> > > 
> > > Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> > > 
> > > ---
> > > 
> > > Hope my understanding is correct.
> > > 
> > > ---
> > > ??virt/kvm/kvm_main.c | 10 ----------
> > > ??1 file changed, 10 deletions(-)
> > > 
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 1f42f1d474b5..c8fb3a9d81fa 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -1403,16 +1403,6 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > ??	if (npages != 1)
> > > ??		return npages;
> > > ??
> > > -	/* map read fault as writable if possible */
> > > -	if (unlikely(!write_fault) && writable) {
> > > -		struct page *wpage;
> > > -
> > > -		if (__get_user_pages_fast(addr, 1, 1, &wpage) == 1) {
> > > -			*writable = true;
> > > -			put_page(page);
> > > -			page = wpage;
> > > -		}
> > > -	}
> Hi, Sean
> 
> Thanks for your comments, let me try to understand it.
> 
> > 
> > This handles the scenario where __get_user_pages_fast() failed and
> > get_user_pages_unlocked() succeeded, obviously with !write_fault &&
> > writable. ??On a read fault, get_user_pages_unlocked() requests the
> > page with read permissions to ensure it doesn't incorrectly fail due
> > to invalid access permissions, i.e. doesn't speculatively request
> > write permissions for a read-only VMA. ??If that succeeds, we then
> I think I can understand at this point.
> 
> Based on my understanding, we set *write* to 1 for __get_user_pages_fast()
> which means we want a writable page. While in hva_to_pfn_slow(),
> FOLL_WRITE is not set for get_user_pages_unlocked(), this means a
> readable page would meet our need.
> 
> In case a read fault, if __get_user_pages_fast() failed, we still could
> try get_user_pages_unlocked() to get a readable page.
> 
> > 
> > use __get_user_pages_fast() to test if the VMA is writable. ??If the
> > VMA is indeed writable, we can create the shadow PTE with write
> > permissions even though we're handling a read fault. ??Marking the
> I can't follow up with you from this point.
> 
> __get_user_pages_fast() iterate the page table and check those
> requirements. We pass the same parameter both in hva_to_pfn_slow() and
> hva_to_pfn_fast(). Curious about why we would get different result if no
> one else change the PTE?

hva_to_pfn_slow() uses the exact access type (read vs write) when calling
get_user_pages_unlocked(), i.e. FOLL_WRITE is only set if @write_fault is
true, whereas the calls to __get_user_pages_fast() always try to pin the
page as writable.

> BTW, you mentioned __get_user_pages_fast() would test the VMA? Would you
> mind pointing which part does this? I am lost at this point.

That was poorly worded.  Let me try again:

If get_user_pages_unlocked() succeeds and we're handling a read-fault,
use __get_user_pages_fast() to check if the page was mapped writable.
Even though we only request read permissions, the PTE *may* have write
permissions if the corresponding VMA is writable, e.g. a writable VMA
will generate a read-only PTE when the page is backed by the zero page
using copy-on-write semantics.

If the PTE is indeed writable, we can also create the shadow PTE with
write permissions even though we're handling a read fault.  Marking
the SPTE as writable means we don't take another page fault if/when
the guest writes the page, e.g. in a RMW scenario we take one page
fault instead of two.

> > 
> > SPTE as writable means we don't take another page fault if/when the
> > guest writes the page, e.g. in a RMW scenario we take one page fault
> > instead of two.
> > 
> Ah, thanks. I guess this is the benefits for this behavior. This reduces
> one page fault by giving a writable page for a read fault.
> 
> > 
> > > 
> > > ??	*pfn = page_to_pfn(page);
> > > ??	return npages;
> > > ??}
Wei Yang Oct. 9, 2018, 3:32 p.m. UTC | #4
On Tue, Oct 09, 2018 at 06:57:33AM -0700, Sean Christopherson wrote:
>On Mon, 2018-10-08 at 16:11 +0000, kvm-owner@vger.kernel.org wrote:
>> Hi, Sean
>> 
>> Thanks for your comments, let me try to understand it.
>> 
>> > 
>> > This handles the scenario where __get_user_pages_fast() failed and
>> > get_user_pages_unlocked() succeeded, obviously with !write_fault &&
>> > writable. ??On a read fault, get_user_pages_unlocked() requests the
>> > page with read permissions to ensure it doesn't incorrectly fail due
>> > to invalid access permissions, i.e. doesn't speculatively request
>> > write permissions for a read-only VMA. ??If that succeeds, we then
>> I think I can understand at this point.
>> 
>> Based on my understanding, we set *write* to 1 for __get_user_pages_fast()
>> which means we want a writable page. While in hva_to_pfn_slow(),
>> FOLL_WRITE is not set for get_user_pages_unlocked(), this means a
>> readable page would meet our need.
>> 
>> In case a read fault, if __get_user_pages_fast() failed, we still could
>> try get_user_pages_unlocked() to get a readable page.
>> 
>> > 
>> > use __get_user_pages_fast() to test if the VMA is writable. ??If the
>> > VMA is indeed writable, we can create the shadow PTE with write
>> > permissions even though we're handling a read fault. ??Marking the
>> I can't follow up with you from this point.
>> 
>> __get_user_pages_fast() iterate the page table and check those
>> requirements. We pass the same parameter both in hva_to_pfn_slow() and
>> hva_to_pfn_fast(). Curious about why we would get different result if no
>> one else change the PTE?
>
>hva_to_pfn_slow() uses the exact access type (read vs write) when calling
>get_user_pages_unlocked(), i.e. FOLL_WRITE is only set if @write_fault is
>true, whereas the calls to __get_user_pages_fast() always try to pin the
>page as writable.
>
>> BTW, you mentioned __get_user_pages_fast() would test the VMA? Would you
>> mind pointing which part does this? I am lost at this point.

Sean,

Thanks a lot for your patient explanation.

>
>That was poorly worded. ??Let me try again:
>
>If get_user_pages_unlocked() succeeds and we're handling a read-fault,
>use __get_user_pages_fast() to check if the page was mapped writable.
>Even though we only request read permissions, the PTE *may* have write
>permissions if the corresponding VMA is writable, e.g. a writable VMA
>will generate a read-only PTE when the page is backed by the zero page
>using copy-on-write semantics.

To be honest, I got some difficulty to understand this part. The reason
is some background knowledge I missed.

Below is my guess on the background knowledge:
----------------------------------------------

There are two level/kind RW permission marking mechanism

   * VMA mapping: encoded in vma->vm_page_prot
   * PTE mapping: encoded in pteval, e.g. _PAGE_RW

And these two permission for the same addr/page could be different.

Hmm, if I am right (not understand the detail in the code, and not
convienced by myself),

   * get_user_pages_unlocked() checks VMA mapping
   * __get_user_pages_fast() checks PTE mapping

Based on your last sentence, VMA mapping has a higher priority. A
writable VMA could be created through mmap, while its corresponding PTE
could be read-only since a copy-on-write semantics applies here.

This is the reason why we could retrive a read-only page from PTE
mapping point of view, but writable from VMA point of view. And then
based on our requirement, give the page write permission.

>
>If the PTE is indeed writable, we can also create the shadow PTE with
>write permissions even though we're handling a read fault.????Marking
>the SPTE as writable means we don't take another page fault if/when
>the guest writes the page, e.g. in a RMW scenario we take one page
>fault instead of two.
>

This makes sence to me.

As you mentioned in the last sentense of previous paragraph, a read-only
PTE with writable VMA means the PTE is indeed writalbe. In this case we
could set the SPTE as write to reduce another page fault in a RMW
scenario.

After all I still have one confusion:

  My confusion is why the 2nd __get_user_pages_fast() would succeed
  while the 1st failed with same parameters passed. Do we touch the PTE
  mapping during get_user_pages_unlocked()?
  
  Hope this is not too silly.

Thanks for your patience a lot :-)

>> > 
>> > SPTE as writable means we don't take another page fault if/when the
>> > guest writes the page, e.g. in a RMW scenario we take one page fault
>> > instead of two.
>> > 
>> Ah, thanks. I guess this is the benefits for this behavior. This reduces
>> one page fault by giving a writable page for a read fault.
>> 
>> > 
>> > > 
>> > > ??	*pfn = page_to_pfn(page);
>> > > ??	return npages;
>> > > ??}
Sean Christopherson Oct. 16, 2018, 3:08 p.m. UTC | #5
On Tue, 2018-10-09 at 15:32 +0000, richard.weiyang@gmail.com wrote:
> > If the PTE is indeed writable, we can also create the shadow PTE with
> > write permissions even though we're handling a read fault.????Marking
> > the SPTE as writable means we don't take another page fault if/when
> > the guest writes the page, e.g. in a RMW scenario we take one page
> > fault instead of two.
> > 
> This makes sence to me.
> 
> As you mentioned in the last sentense of previous paragraph, a read-only
> PTE with writable VMA means the PTE is indeed writalbe. In this case we
> could set the SPTE as write to reduce another page fault in a RMW
> scenario.
> 
> After all I still have one confusion:
> 
>   My confusion is why the 2nd __get_user_pages_fast() would succeed
>   while the 1st failed with same parameters passed. Do we touch the PTE
>   mapping during get_user_pages_unlocked()?

Yep, get_user_pages_unlocked() will fault-in the page if necessary.
Wei Yang Oct. 16, 2018, 10:38 p.m. UTC | #6
On Tue, Oct 16, 2018 at 08:08:44AM -0700, Sean Christopherson wrote:
>On Tue, 2018-10-09 at 15:32 +0000, richard.weiyang@gmail.com wrote:
>> > If the PTE is indeed writable, we can also create the shadow PTE with
>> > write permissions even though we're handling a read fault.????Marking
>> > the SPTE as writable means we don't take another page fault if/when
>> > the guest writes the page, e.g. in a RMW scenario we take one page
>> > fault instead of two.
>> > 
>> This makes sence to me.
>> 
>> As you mentioned in the last sentense of previous paragraph, a read-only
>> PTE with writable VMA means the PTE is indeed writalbe. In this case we
>> could set the SPTE as write to reduce another page fault in a RMW
>> scenario.
>> 
>> After all I still have one confusion:
>> 
>> ?? My confusion is why the 2nd __get_user_pages_fast() would succeed
>> ?? while the 1st failed with same parameters passed. Do we touch the PTE
>> ?? mapping during get_user_pages_unlocked()?
>
>Yep, get_user_pages_unlocked() will fault-in the page if necessary.

Ah, got it.

Thanks :-)
diff mbox series

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1f42f1d474b5..c8fb3a9d81fa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1403,16 +1403,6 @@  static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 	if (npages != 1)
 		return npages;
 
-	/* map read fault as writable if possible */
-	if (unlikely(!write_fault) && writable) {
-		struct page *wpage;
-
-		if (__get_user_pages_fast(addr, 1, 1, &wpage) == 1) {
-			*writable = true;
-			put_page(page);
-			page = wpage;
-		}
-	}
 	*pfn = page_to_pfn(page);
 	return npages;
 }