Message ID | 20181008144138.41677-1-richard.weiyang@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] KVM: remove the writable page for read fault case in hva_to_pfn_slow() | expand |
On Mon, 2018-10-08 at 22:41 +0800, kvm-owner@vger.kernel.org wrote: > Case (!write_fault && writable) has been handled in hva_to_pfn_fast(), > it is not necessary to try again if hva_to_pfn_fast() already failed. > > This patch removes this case in hva_to_pfn_slow(). > > Signed-off-by: Wei Yang <richard.weiyang@gmail.com> > > --- > > Hope my understanding is correct. > > --- > virt/kvm/kvm_main.c | 10 ---------- > 1 file changed, 10 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 1f42f1d474b5..c8fb3a9d81fa 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1403,16 +1403,6 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, > if (npages != 1) > return npages; > > - /* map read fault as writable if possible */ > - if (unlikely(!write_fault) && writable) { > - struct page *wpage; > - > - if (__get_user_pages_fast(addr, 1, 1, &wpage) == 1) { > - *writable = true; > - put_page(page); > - page = wpage; > - } > - } This handles the scenario where __get_user_pages_fast() failed and get_user_pages_unlocked() succeeded, obviously with !write_fault && writable. On a read fault, get_user_pages_unlocked() requests the page with read permissions to ensure it doesn't incorrectly fail due to invalid access permissions, i.e. doesn't speculatively request write permissions for a read-only VMA. If that succeeds, we then use __get_user_pages_fast() to test if the VMA is writable. If the VMA is indeed writable, we can create the shadow PTE with write permissions even though we're handling a read fault. Marking the SPTE as writable means we don't take another page fault if/when the guest writes the page, e.g. in a RMW scenario we take one page fault instead of two. > *pfn = page_to_pfn(page); > return npages; > }
On Mon, Oct 08, 2018 at 08:21:53AM -0700, Sean Christopherson wrote: >On Mon, 2018-10-08 at 22:41 +0800, kvm-owner@vger.kernel.org wrote: >> Case (!write_fault && writable) has been handled in hva_to_pfn_fast(), >> it is not necessary to try again if hva_to_pfn_fast() already failed. >> >> This patch removes this case in hva_to_pfn_slow(). >> >> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >> >> --- >> >> Hope my understanding is correct. >> >> --- >> ??virt/kvm/kvm_main.c | 10 ---------- >> ??1 file changed, 10 deletions(-) >> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index 1f42f1d474b5..c8fb3a9d81fa 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -1403,16 +1403,6 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, >> ?? if (npages != 1) >> ?? return npages; >> ?? >> - /* map read fault as writable if possible */ >> - if (unlikely(!write_fault) && writable) { >> - struct page *wpage; >> - >> - if (__get_user_pages_fast(addr, 1, 1, &wpage) == 1) { >> - *writable = true; >> - put_page(page); >> - page = wpage; >> - } >> - } > Hi, Sean Thanks for your comments, let me try to understand it. >This handles the scenario where __get_user_pages_fast() failed and >get_user_pages_unlocked() succeeded, obviously with !write_fault && >writable. ??On a read fault, get_user_pages_unlocked() requests the >page with read permissions to ensure it doesn't incorrectly fail due >to invalid access permissions, i.e. doesn't speculatively request >write permissions for a read-only VMA. ??If that succeeds, we then I think I can understand at this point. Based on my understanding, we set *write* to 1 for __get_user_pages_fast() which means we want a writable page. While in hva_to_pfn_slow(), FOLL_WRITE is not set for get_user_pages_unlocked(), this means a readable page would meet our need. In case a read fault, if __get_user_pages_fast() failed, we still could try get_user_pages_unlocked() to get a readable page. >use __get_user_pages_fast() to test if the VMA is writable. ??If the >VMA is indeed writable, we can create the shadow PTE with write >permissions even though we're handling a read fault. ??Marking the I can't follow up with you from this point. __get_user_pages_fast() iterate the page table and check those requirements. We pass the same parameter both in hva_to_pfn_slow() and hva_to_pfn_fast(). Curious about why we would get different result if no one else change the PTE? BTW, you mentioned __get_user_pages_fast() would test the VMA? Would you mind pointing which part does this? I am lost at this point. >SPTE as writable means we don't take another page fault if/when the >guest writes the page, e.g. in a RMW scenario we take one page fault >instead of two. > Ah, thanks. I guess this is the benefits for this behavior. This reduces one page fault by giving a writable page for a read fault. >> ?? *pfn = page_to_pfn(page); >> ?? return npages; >> ??}
On Mon, 2018-10-08 at 16:11 +0000, kvm-owner@vger.kernel.org wrote: > On Mon, Oct 08, 2018 at 08:21:53AM -0700, Sean Christopherson wrote: > > > > On Mon, 2018-10-08 at 22:41 +0800, kvm-owner@vger.kernel.org wrote: > > > > > > Case (!write_fault && writable) has been handled in hva_to_pfn_fast(), > > > it is not necessary to try again if hva_to_pfn_fast() already failed. > > > > > > This patch removes this case in hva_to_pfn_slow(). > > > > > > Signed-off-by: Wei Yang <richard.weiyang@gmail.com> > > > > > > --- > > > > > > Hope my understanding is correct. > > > > > > --- > > > ??virt/kvm/kvm_main.c | 10 ---------- > > > ??1 file changed, 10 deletions(-) > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 1f42f1d474b5..c8fb3a9d81fa 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -1403,16 +1403,6 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, > > > ?? if (npages != 1) > > > ?? return npages; > > > ?? > > > - /* map read fault as writable if possible */ > > > - if (unlikely(!write_fault) && writable) { > > > - struct page *wpage; > > > - > > > - if (__get_user_pages_fast(addr, 1, 1, &wpage) == 1) { > > > - *writable = true; > > > - put_page(page); > > > - page = wpage; > > > - } > > > - } > Hi, Sean > > Thanks for your comments, let me try to understand it. > > > > > This handles the scenario where __get_user_pages_fast() failed and > > get_user_pages_unlocked() succeeded, obviously with !write_fault && > > writable. ??On a read fault, get_user_pages_unlocked() requests the > > page with read permissions to ensure it doesn't incorrectly fail due > > to invalid access permissions, i.e. doesn't speculatively request > > write permissions for a read-only VMA. ??If that succeeds, we then > I think I can understand at this point. > > Based on my understanding, we set *write* to 1 for __get_user_pages_fast() > which means we want a writable page. While in hva_to_pfn_slow(), > FOLL_WRITE is not set for get_user_pages_unlocked(), this means a > readable page would meet our need. > > In case a read fault, if __get_user_pages_fast() failed, we still could > try get_user_pages_unlocked() to get a readable page. > > > > > use __get_user_pages_fast() to test if the VMA is writable. ??If the > > VMA is indeed writable, we can create the shadow PTE with write > > permissions even though we're handling a read fault. ??Marking the > I can't follow up with you from this point. > > __get_user_pages_fast() iterate the page table and check those > requirements. We pass the same parameter both in hva_to_pfn_slow() and > hva_to_pfn_fast(). Curious about why we would get different result if no > one else change the PTE? hva_to_pfn_slow() uses the exact access type (read vs write) when calling get_user_pages_unlocked(), i.e. FOLL_WRITE is only set if @write_fault is true, whereas the calls to __get_user_pages_fast() always try to pin the page as writable. > BTW, you mentioned __get_user_pages_fast() would test the VMA? Would you > mind pointing which part does this? I am lost at this point. That was poorly worded. Let me try again: If get_user_pages_unlocked() succeeds and we're handling a read-fault, use __get_user_pages_fast() to check if the page was mapped writable. Even though we only request read permissions, the PTE *may* have write permissions if the corresponding VMA is writable, e.g. a writable VMA will generate a read-only PTE when the page is backed by the zero page using copy-on-write semantics. If the PTE is indeed writable, we can also create the shadow PTE with write permissions even though we're handling a read fault. Marking the SPTE as writable means we don't take another page fault if/when the guest writes the page, e.g. in a RMW scenario we take one page fault instead of two. > > > > SPTE as writable means we don't take another page fault if/when the > > guest writes the page, e.g. in a RMW scenario we take one page fault > > instead of two. > > > Ah, thanks. I guess this is the benefits for this behavior. This reduces > one page fault by giving a writable page for a read fault. > > > > > > > > > ?? *pfn = page_to_pfn(page); > > > ?? return npages; > > > ??}
On Tue, Oct 09, 2018 at 06:57:33AM -0700, Sean Christopherson wrote: >On Mon, 2018-10-08 at 16:11 +0000, kvm-owner@vger.kernel.org wrote: >> Hi, Sean >> >> Thanks for your comments, let me try to understand it. >> >> > >> > This handles the scenario where __get_user_pages_fast() failed and >> > get_user_pages_unlocked() succeeded, obviously with !write_fault && >> > writable. ??On a read fault, get_user_pages_unlocked() requests the >> > page with read permissions to ensure it doesn't incorrectly fail due >> > to invalid access permissions, i.e. doesn't speculatively request >> > write permissions for a read-only VMA. ??If that succeeds, we then >> I think I can understand at this point. >> >> Based on my understanding, we set *write* to 1 for __get_user_pages_fast() >> which means we want a writable page. While in hva_to_pfn_slow(), >> FOLL_WRITE is not set for get_user_pages_unlocked(), this means a >> readable page would meet our need. >> >> In case a read fault, if __get_user_pages_fast() failed, we still could >> try get_user_pages_unlocked() to get a readable page. >> >> > >> > use __get_user_pages_fast() to test if the VMA is writable. ??If the >> > VMA is indeed writable, we can create the shadow PTE with write >> > permissions even though we're handling a read fault. ??Marking the >> I can't follow up with you from this point. >> >> __get_user_pages_fast() iterate the page table and check those >> requirements. We pass the same parameter both in hva_to_pfn_slow() and >> hva_to_pfn_fast(). Curious about why we would get different result if no >> one else change the PTE? > >hva_to_pfn_slow() uses the exact access type (read vs write) when calling >get_user_pages_unlocked(), i.e. FOLL_WRITE is only set if @write_fault is >true, whereas the calls to __get_user_pages_fast() always try to pin the >page as writable. > >> BTW, you mentioned __get_user_pages_fast() would test the VMA? Would you >> mind pointing which part does this? I am lost at this point. Sean, Thanks a lot for your patient explanation. > >That was poorly worded. ??Let me try again: > >If get_user_pages_unlocked() succeeds and we're handling a read-fault, >use __get_user_pages_fast() to check if the page was mapped writable. >Even though we only request read permissions, the PTE *may* have write >permissions if the corresponding VMA is writable, e.g. a writable VMA >will generate a read-only PTE when the page is backed by the zero page >using copy-on-write semantics. To be honest, I got some difficulty to understand this part. The reason is some background knowledge I missed. Below is my guess on the background knowledge: ---------------------------------------------- There are two level/kind RW permission marking mechanism * VMA mapping: encoded in vma->vm_page_prot * PTE mapping: encoded in pteval, e.g. _PAGE_RW And these two permission for the same addr/page could be different. Hmm, if I am right (not understand the detail in the code, and not convienced by myself), * get_user_pages_unlocked() checks VMA mapping * __get_user_pages_fast() checks PTE mapping Based on your last sentence, VMA mapping has a higher priority. A writable VMA could be created through mmap, while its corresponding PTE could be read-only since a copy-on-write semantics applies here. This is the reason why we could retrive a read-only page from PTE mapping point of view, but writable from VMA point of view. And then based on our requirement, give the page write permission. > >If the PTE is indeed writable, we can also create the shadow PTE with >write permissions even though we're handling a read fault.????Marking >the SPTE as writable means we don't take another page fault if/when >the guest writes the page, e.g. in a RMW scenario we take one page >fault instead of two. > This makes sence to me. As you mentioned in the last sentense of previous paragraph, a read-only PTE with writable VMA means the PTE is indeed writalbe. In this case we could set the SPTE as write to reduce another page fault in a RMW scenario. After all I still have one confusion: My confusion is why the 2nd __get_user_pages_fast() would succeed while the 1st failed with same parameters passed. Do we touch the PTE mapping during get_user_pages_unlocked()? Hope this is not too silly. Thanks for your patience a lot :-) >> > >> > SPTE as writable means we don't take another page fault if/when the >> > guest writes the page, e.g. in a RMW scenario we take one page fault >> > instead of two. >> > >> Ah, thanks. I guess this is the benefits for this behavior. This reduces >> one page fault by giving a writable page for a read fault. >> >> > >> > > >> > > ?? *pfn = page_to_pfn(page); >> > > ?? return npages; >> > > ??}
On Tue, 2018-10-09 at 15:32 +0000, richard.weiyang@gmail.com wrote: > > If the PTE is indeed writable, we can also create the shadow PTE with > > write permissions even though we're handling a read fault.????Marking > > the SPTE as writable means we don't take another page fault if/when > > the guest writes the page, e.g. in a RMW scenario we take one page > > fault instead of two. > > > This makes sence to me. > > As you mentioned in the last sentense of previous paragraph, a read-only > PTE with writable VMA means the PTE is indeed writalbe. In this case we > could set the SPTE as write to reduce another page fault in a RMW > scenario. > > After all I still have one confusion: > > My confusion is why the 2nd __get_user_pages_fast() would succeed > while the 1st failed with same parameters passed. Do we touch the PTE > mapping during get_user_pages_unlocked()? Yep, get_user_pages_unlocked() will fault-in the page if necessary.
On Tue, Oct 16, 2018 at 08:08:44AM -0700, Sean Christopherson wrote: >On Tue, 2018-10-09 at 15:32 +0000, richard.weiyang@gmail.com wrote: >> > If the PTE is indeed writable, we can also create the shadow PTE with >> > write permissions even though we're handling a read fault.????Marking >> > the SPTE as writable means we don't take another page fault if/when >> > the guest writes the page, e.g. in a RMW scenario we take one page >> > fault instead of two. >> > >> This makes sence to me. >> >> As you mentioned in the last sentense of previous paragraph, a read-only >> PTE with writable VMA means the PTE is indeed writalbe. In this case we >> could set the SPTE as write to reduce another page fault in a RMW >> scenario. >> >> After all I still have one confusion: >> >> ?? My confusion is why the 2nd __get_user_pages_fast() would succeed >> ?? while the 1st failed with same parameters passed. Do we touch the PTE >> ?? mapping during get_user_pages_unlocked()? > >Yep, get_user_pages_unlocked() will fault-in the page if necessary. Ah, got it. Thanks :-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1f42f1d474b5..c8fb3a9d81fa 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1403,16 +1403,6 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, if (npages != 1) return npages; - /* map read fault as writable if possible */ - if (unlikely(!write_fault) && writable) { - struct page *wpage; - - if (__get_user_pages_fast(addr, 1, 1, &wpage) == 1) { - *writable = true; - put_page(page); - page = wpage; - } - } *pfn = page_to_pfn(page); return npages; }
Case (!write_fault && writable) has been handled in hva_to_pfn_fast(), it is not necessary to try again if hva_to_pfn_fast() already failed. This patch removes this case in hva_to_pfn_slow(). Signed-off-by: Wei Yang <richard.weiyang@gmail.com> --- Hope my understanding is correct. --- virt/kvm/kvm_main.c | 10 ---------- 1 file changed, 10 deletions(-)