Message ID | 1377866497-3866-1-git-send-email-pbonzini@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote: > Page tables in a read-only memory slot will currently cause a triple > fault because the page walker uses gfn_to_hva and it fails on such a slot. > > OVMF uses such a page table; however, real hardware seems to be fine with > that as long as the accessed/dirty bits are set. Save whether the slot > is readonly, and later check it when updating the accessed and dirty bits. > The fix looks OK to me, but some comment below. > Cc: stable@vger.kernel.org > Cc: gleb@redhat.com > Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > CCing to stable@ since the regression was introduced with > support for readonly memory slots. > > arch/x86/kvm/paging_tmpl.h | 7 ++++++- > include/linux/kvm_host.h | 1 + > virt/kvm/kvm_main.c | 14 +++++++++----- > 3 files changed, 16 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h > index 0433301..dadc5c0 100644 > --- a/arch/x86/kvm/paging_tmpl.h > +++ b/arch/x86/kvm/paging_tmpl.h > @@ -99,6 +99,7 @@ struct guest_walker { > pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; > gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; > pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; > + bool pte_writable[PT_MAX_FULL_LEVELS]; > unsigned pt_access; > unsigned pte_access; > gfn_t gfn; > @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, > if (pte == orig_pte) > continue; > > + if (unlikely(!walker->pte_writable[level - 1])) > + return -EACCES; > + > ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); > if (ret) > return ret; > @@ -309,7 +313,8 @@ retry_walk: > goto error; > real_gfn = gpa_to_gfn(real_gfn); > > - host_addr = gfn_to_hva(vcpu->kvm, real_gfn); > + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn, > + &walker->pte_writable[walker->level - 1]); The use of gfn_to_hva_read is misleading. The code can still write into gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva() to gfn_to_hva_write(). This makes me think are there other places where gfn_to_hva() was used, but gfn_to_hva_prot() should have been? - kvm_host_page_size() looks incorrect. We never use huge page to map read only memory slots currently. - kvm_handle_bad_page() also looks incorrect and may cause incorrect address to be reported to userspace. - kvm_setup_async_pf() also incorrect. Makes all page fault on read only slot to be sync. - kvm_vm_fault() one looks OK since function assumes write only slots, but it is obsolete and should be deleted anyway. Others in generic and x86 code looks OK, somebody need to check ppc and arm code. > if (unlikely(kvm_is_error_hva(host_addr))) > goto error; > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index ca645a0..22f9cdf 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -533,6 +533,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages, > > struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); > unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn); > +unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable); > unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn); > void kvm_release_page_clean(struct page *page); > void kvm_release_page_dirty(struct page *page); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index f7e4334..418d037 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1078,11 +1078,15 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn) > EXPORT_SYMBOL_GPL(gfn_to_hva); > > /* > - * The hva returned by this function is only allowed to be read. > - * It should pair with kvm_read_hva() or kvm_read_hva_atomic(). > + * If writable is set to false, the hva returned by this function is only > + * allowed to be read. > */ > -static unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn) > +unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable) > { > + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn); > + if (writable) > + *writable = !memslot_is_readonly(slot); > + > return __gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL, false); > } > > @@ -1450,7 +1454,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, > int r; > unsigned long addr; > > - addr = gfn_to_hva_read(kvm, gfn); > + addr = gfn_to_hva_read(kvm, gfn, NULL); > if (kvm_is_error_hva(addr)) > return -EFAULT; > r = kvm_read_hva(data, (void __user *)addr + offset, len); > @@ -1488,7 +1492,7 @@ int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data, > gfn_t gfn = gpa >> PAGE_SHIFT; > int offset = offset_in_page(gpa); > > - addr = gfn_to_hva_read(kvm, gfn); > + addr = gfn_to_hva_read(kvm, gfn, NULL); > if (kvm_is_error_hva(addr)) > return -EFAULT; > pagefault_disable(); > -- > 1.8.3.1 -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/30/2013 08:41 PM, Paolo Bonzini wrote: > Page tables in a read-only memory slot will currently cause a triple > fault because the page walker uses gfn_to_hva and it fails on such a slot. > > OVMF uses such a page table; however, real hardware seems to be fine with > that as long as the accessed/dirty bits are set. Save whether the slot > is readonly, and later check it when updating the accessed and dirty bits. Paolo, do you know why OVMF is using readonly memory like this? AFAIK, The fault trigged by this kind of access can hardly be fixed by userspace since the fault is trigged by pagetable walking not by the current instruction. Do you have any idea to let uerspace emulate it properly? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: > On 08/30/2013 08:41 PM, Paolo Bonzini wrote: > > Page tables in a read-only memory slot will currently cause a triple > > fault because the page walker uses gfn_to_hva and it fails on such a slot. > > > > OVMF uses such a page table; however, real hardware seems to be fine with > > that as long as the accessed/dirty bits are set. Save whether the slot > > is readonly, and later check it when updating the accessed and dirty bits. > > Paolo, do you know why OVMF is using readonly memory like this? > Just a guess, but perhaps they want to move to paging mode as early as possible even before memory controller is fully initialized. > AFAIK, The fault trigged by this kind of access can hardly be fixed by > userspace since the fault is trigged by pagetable walking not by the current > instruction. Do you have any idea to let uerspace emulate it properly? Not sure what userspace you mean here, but there shouldn't be a fault in the first place if ROM page tables have access/dirty bit set and they do. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/01/2013 05:17 PM, Gleb Natapov wrote: > On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote: >> Page tables in a read-only memory slot will currently cause a triple >> fault because the page walker uses gfn_to_hva and it fails on such a slot. >> >> OVMF uses such a page table; however, real hardware seems to be fine with >> that as long as the accessed/dirty bits are set. Save whether the slot >> is readonly, and later check it when updating the accessed and dirty bits. >> > The fix looks OK to me, but some comment below. > >> Cc: stable@vger.kernel.org >> Cc: gleb@redhat.com >> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >> --- >> CCing to stable@ since the regression was introduced with >> support for readonly memory slots. >> >> arch/x86/kvm/paging_tmpl.h | 7 ++++++- >> include/linux/kvm_host.h | 1 + >> virt/kvm/kvm_main.c | 14 +++++++++----- >> 3 files changed, 16 insertions(+), 6 deletions(-) >> >> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h >> index 0433301..dadc5c0 100644 >> --- a/arch/x86/kvm/paging_tmpl.h >> +++ b/arch/x86/kvm/paging_tmpl.h >> @@ -99,6 +99,7 @@ struct guest_walker { >> pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; >> gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; >> pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; >> + bool pte_writable[PT_MAX_FULL_LEVELS]; >> unsigned pt_access; >> unsigned pte_access; >> gfn_t gfn; >> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, >> if (pte == orig_pte) >> continue; >> >> + if (unlikely(!walker->pte_writable[level - 1])) >> + return -EACCES; >> + >> ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); >> if (ret) >> return ret; >> @@ -309,7 +313,8 @@ retry_walk: >> goto error; >> real_gfn = gpa_to_gfn(real_gfn); >> >> - host_addr = gfn_to_hva(vcpu->kvm, real_gfn); >> + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn, >> + &walker->pte_writable[walker->level - 1]); > The use of gfn_to_hva_read is misleading. The code can still write into > gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva() > to gfn_to_hva_write(). Yes. I agreed. > > This makes me think are there other places where gfn_to_hva() was > used, but gfn_to_hva_prot() should have been? > - kvm_host_page_size() looks incorrect. We never use huge page to map > read only memory slots currently. It only checks whether gfn have been mapped, I think we can use gfn_to_hva_read() instead, the real permission will be checked when we translate the gfn to pfn. > - kvm_handle_bad_page() also looks incorrect and may cause incorrect > address to be reported to userspace. I have no idea on this point. kvm_handle_bad_page() is called when it failed to translate the target gfn to pfn, then the emulator can detect the error on target gfn properly. no? Or i misunderstood your meaning? > - kvm_setup_async_pf() also incorrect. Makes all page fault on read > only slot to be sync. > - kvm_vm_fault() one looks OK since function assumes write only slots, > but it is obsolete and should be deleted anyway. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote: > On 09/01/2013 05:17 PM, Gleb Natapov wrote: > > On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote: > >> Page tables in a read-only memory slot will currently cause a triple > >> fault because the page walker uses gfn_to_hva and it fails on such a slot. > >> > >> OVMF uses such a page table; however, real hardware seems to be fine with > >> that as long as the accessed/dirty bits are set. Save whether the slot > >> is readonly, and later check it when updating the accessed and dirty bits. > >> > > The fix looks OK to me, but some comment below. > > > >> Cc: stable@vger.kernel.org > >> Cc: gleb@redhat.com > >> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > >> --- > >> CCing to stable@ since the regression was introduced with > >> support for readonly memory slots. > >> > >> arch/x86/kvm/paging_tmpl.h | 7 ++++++- > >> include/linux/kvm_host.h | 1 + > >> virt/kvm/kvm_main.c | 14 +++++++++----- > >> 3 files changed, 16 insertions(+), 6 deletions(-) > >> > >> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h > >> index 0433301..dadc5c0 100644 > >> --- a/arch/x86/kvm/paging_tmpl.h > >> +++ b/arch/x86/kvm/paging_tmpl.h > >> @@ -99,6 +99,7 @@ struct guest_walker { > >> pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; > >> gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; > >> pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; > >> + bool pte_writable[PT_MAX_FULL_LEVELS]; > >> unsigned pt_access; > >> unsigned pte_access; > >> gfn_t gfn; > >> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, > >> if (pte == orig_pte) > >> continue; > >> > >> + if (unlikely(!walker->pte_writable[level - 1])) > >> + return -EACCES; > >> + > >> ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); > >> if (ret) > >> return ret; > >> @@ -309,7 +313,8 @@ retry_walk: > >> goto error; > >> real_gfn = gpa_to_gfn(real_gfn); > >> > >> - host_addr = gfn_to_hva(vcpu->kvm, real_gfn); > >> + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn, > >> + &walker->pte_writable[walker->level - 1]); > > The use of gfn_to_hva_read is misleading. The code can still write into > > gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva() > > to gfn_to_hva_write(). > > Yes. I agreed. > > > > > This makes me think are there other places where gfn_to_hva() was > > used, but gfn_to_hva_prot() should have been? > > - kvm_host_page_size() looks incorrect. We never use huge page to map > > read only memory slots currently. > > It only checks whether gfn have been mapped, I think we can use > gfn_to_hva_read() instead, the real permission will be checked when we translate > the gfn to pfn. > Yes, all the cases I listed should be changed to use function that looks at both regular and RO slots. > > - kvm_handle_bad_page() also looks incorrect and may cause incorrect > > address to be reported to userspace. > > I have no idea on this point. kvm_handle_bad_page() is called when it failed to > translate the target gfn to pfn, then the emulator can detect the error on target gfn > properly. no? Or i misunderstood your meaning? > I am talking about the following code: if (pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(gfn_to_hva(vcpu->kvm, gfn), current); return 0; } pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need to report the liner address of the faulty memory to a userspace here, but if gfn is in a RO slot gfn_to_hva() will not return correct address here. > > - kvm_setup_async_pf() also incorrect. Makes all page fault on read > > only slot to be sync. > > - kvm_vm_fault() one looks OK since function assumes write only slots, > > but it is obsolete and should be deleted anyway. > -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/02/2013 05:25 PM, Gleb Natapov wrote: > On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: >> On 08/30/2013 08:41 PM, Paolo Bonzini wrote: >>> Page tables in a read-only memory slot will currently cause a triple >>> fault because the page walker uses gfn_to_hva and it fails on such a slot. >>> >>> OVMF uses such a page table; however, real hardware seems to be fine with >>> that as long as the accessed/dirty bits are set. Save whether the slot >>> is readonly, and later check it when updating the accessed and dirty bits. >> >> Paolo, do you know why OVMF is using readonly memory like this? >> > Just a guess, but perhaps they want to move to paging mode as early as > possible even before memory controller is fully initialized. > >> AFAIK, The fault trigged by this kind of access can hardly be fixed by >> userspace since the fault is trigged by pagetable walking not by the current >> instruction. Do you have any idea to let uerspace emulate it properly? > Not sure what userspace you mean here, but there shouldn't be a fault in the I just wonder how to fix this kind of fault. The current patch returns -EACCES but that will crash the guest. I think we'd better let userspace to fix this error (let userspace set the D/A bit.) > first place if ROM page tables have access/dirty bit set and they do. Yes, so we can not call x86_emulate_instruction() to fix this fault (that function emulates the access on the first place). Need directly return a MMIO-exit to userpsace when met this fault? What happen if this fault on pagetable-walking is trigged in x86_emulate_instruction().? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/02/2013 05:49 PM, Gleb Natapov wrote: > On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote: >> On 09/01/2013 05:17 PM, Gleb Natapov wrote: >>> On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote: >>>> Page tables in a read-only memory slot will currently cause a triple >>>> fault because the page walker uses gfn_to_hva and it fails on such a slot. >>>> >>>> OVMF uses such a page table; however, real hardware seems to be fine with >>>> that as long as the accessed/dirty bits are set. Save whether the slot >>>> is readonly, and later check it when updating the accessed and dirty bits. >>>> >>> The fix looks OK to me, but some comment below. >>> >>>> Cc: stable@vger.kernel.org >>>> Cc: gleb@redhat.com >>>> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> >>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >>>> --- >>>> CCing to stable@ since the regression was introduced with >>>> support for readonly memory slots. >>>> >>>> arch/x86/kvm/paging_tmpl.h | 7 ++++++- >>>> include/linux/kvm_host.h | 1 + >>>> virt/kvm/kvm_main.c | 14 +++++++++----- >>>> 3 files changed, 16 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h >>>> index 0433301..dadc5c0 100644 >>>> --- a/arch/x86/kvm/paging_tmpl.h >>>> +++ b/arch/x86/kvm/paging_tmpl.h >>>> @@ -99,6 +99,7 @@ struct guest_walker { >>>> pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; >>>> gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; >>>> pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; >>>> + bool pte_writable[PT_MAX_FULL_LEVELS]; >>>> unsigned pt_access; >>>> unsigned pte_access; >>>> gfn_t gfn; >>>> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, >>>> if (pte == orig_pte) >>>> continue; >>>> >>>> + if (unlikely(!walker->pte_writable[level - 1])) >>>> + return -EACCES; >>>> + >>>> ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); >>>> if (ret) >>>> return ret; >>>> @@ -309,7 +313,8 @@ retry_walk: >>>> goto error; >>>> real_gfn = gpa_to_gfn(real_gfn); >>>> >>>> - host_addr = gfn_to_hva(vcpu->kvm, real_gfn); >>>> + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn, >>>> + &walker->pte_writable[walker->level - 1]); >>> The use of gfn_to_hva_read is misleading. The code can still write into >>> gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva() >>> to gfn_to_hva_write(). >> >> Yes. I agreed. >> >>> >>> This makes me think are there other places where gfn_to_hva() was >>> used, but gfn_to_hva_prot() should have been? >>> - kvm_host_page_size() looks incorrect. We never use huge page to map >>> read only memory slots currently. >> >> It only checks whether gfn have been mapped, I think we can use >> gfn_to_hva_read() instead, the real permission will be checked when we translate >> the gfn to pfn. >> > Yes, all the cases I listed should be changed to use function that looks > at both regular and RO slots. > >>> - kvm_handle_bad_page() also looks incorrect and may cause incorrect >>> address to be reported to userspace. >> >> I have no idea on this point. kvm_handle_bad_page() is called when it failed to >> translate the target gfn to pfn, then the emulator can detect the error on target gfn >> properly. no? Or i misunderstood your meaning? >> > I am talking about the following code: > > if (pfn == KVM_PFN_ERR_HWPOISON) { > kvm_send_hwpoison_signal(gfn_to_hva(vcpu->kvm, gfn), current); > return 0; > } > > pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need > to report the liner address of the faulty memory to a userspace here, > but if gfn is in a RO slot gfn_to_hva() will not return correct address > here. Got it, thanks for your explanation. BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote: > On 09/02/2013 05:25 PM, Gleb Natapov wrote: > > On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: > >> On 08/30/2013 08:41 PM, Paolo Bonzini wrote: > >>> Page tables in a read-only memory slot will currently cause a triple > >>> fault because the page walker uses gfn_to_hva and it fails on such a slot. > >>> > >>> OVMF uses such a page table; however, real hardware seems to be fine with > >>> that as long as the accessed/dirty bits are set. Save whether the slot > >>> is readonly, and later check it when updating the accessed and dirty bits. > >> > >> Paolo, do you know why OVMF is using readonly memory like this? > >> > > Just a guess, but perhaps they want to move to paging mode as early as > > possible even before memory controller is fully initialized. > > > >> AFAIK, The fault trigged by this kind of access can hardly be fixed by > >> userspace since the fault is trigged by pagetable walking not by the current > >> instruction. Do you have any idea to let uerspace emulate it properly? > > Not sure what userspace you mean here, but there shouldn't be a fault in the > > I just wonder how to fix this kind of fault. The current patch returns -EACCES > but that will crash the guest. I think we'd better let userspace to fix this > error (let userspace set the D/A bit.) > Ugh, this is not good. Missed that. Don't know what real HW will do here, but the easy thing for us to do would be to just return success. > > first place if ROM page tables have access/dirty bit set and they do. > > Yes, so we can not call x86_emulate_instruction() to fix this fault (that function > emulates the access on the first place). Need directly return a MMIO-exit to > userpsace when met this fault? What happen if this fault on pagetable-walking > is trigged in x86_emulate_instruction().? I think we should not return MMIO-exit to userspace. Either ignore write attempt or kill a guest. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 02, 2013 at 06:05:10PM +0800, Xiao Guangrong wrote: > On 09/02/2013 05:49 PM, Gleb Natapov wrote: > > On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote: > >> On 09/01/2013 05:17 PM, Gleb Natapov wrote: > >>> On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote: > >>>> Page tables in a read-only memory slot will currently cause a triple > >>>> fault because the page walker uses gfn_to_hva and it fails on such a slot. > >>>> > >>>> OVMF uses such a page table; however, real hardware seems to be fine with > >>>> that as long as the accessed/dirty bits are set. Save whether the slot > >>>> is readonly, and later check it when updating the accessed and dirty bits. > >>>> > >>> The fix looks OK to me, but some comment below. > >>> > >>>> Cc: stable@vger.kernel.org > >>>> Cc: gleb@redhat.com > >>>> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > >>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > >>>> --- > >>>> CCing to stable@ since the regression was introduced with > >>>> support for readonly memory slots. > >>>> > >>>> arch/x86/kvm/paging_tmpl.h | 7 ++++++- > >>>> include/linux/kvm_host.h | 1 + > >>>> virt/kvm/kvm_main.c | 14 +++++++++----- > >>>> 3 files changed, 16 insertions(+), 6 deletions(-) > >>>> > >>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h > >>>> index 0433301..dadc5c0 100644 > >>>> --- a/arch/x86/kvm/paging_tmpl.h > >>>> +++ b/arch/x86/kvm/paging_tmpl.h > >>>> @@ -99,6 +99,7 @@ struct guest_walker { > >>>> pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; > >>>> gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; > >>>> pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; > >>>> + bool pte_writable[PT_MAX_FULL_LEVELS]; > >>>> unsigned pt_access; > >>>> unsigned pte_access; > >>>> gfn_t gfn; > >>>> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, > >>>> if (pte == orig_pte) > >>>> continue; > >>>> > >>>> + if (unlikely(!walker->pte_writable[level - 1])) > >>>> + return -EACCES; > >>>> + > >>>> ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); > >>>> if (ret) > >>>> return ret; > >>>> @@ -309,7 +313,8 @@ retry_walk: > >>>> goto error; > >>>> real_gfn = gpa_to_gfn(real_gfn); > >>>> > >>>> - host_addr = gfn_to_hva(vcpu->kvm, real_gfn); > >>>> + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn, > >>>> + &walker->pte_writable[walker->level - 1]); > >>> The use of gfn_to_hva_read is misleading. The code can still write into > >>> gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva() > >>> to gfn_to_hva_write(). > >> > >> Yes. I agreed. > >> > >>> > >>> This makes me think are there other places where gfn_to_hva() was > >>> used, but gfn_to_hva_prot() should have been? > >>> - kvm_host_page_size() looks incorrect. We never use huge page to map > >>> read only memory slots currently. > >> > >> It only checks whether gfn have been mapped, I think we can use > >> gfn_to_hva_read() instead, the real permission will be checked when we translate > >> the gfn to pfn. > >> > > Yes, all the cases I listed should be changed to use function that looks > > at both regular and RO slots. > > > >>> - kvm_handle_bad_page() also looks incorrect and may cause incorrect > >>> address to be reported to userspace. > >> > >> I have no idea on this point. kvm_handle_bad_page() is called when it failed to > >> translate the target gfn to pfn, then the emulator can detect the error on target gfn > >> properly. no? Or i misunderstood your meaning? > >> > > I am talking about the following code: > > > > if (pfn == KVM_PFN_ERR_HWPOISON) { > > kvm_send_hwpoison_signal(gfn_to_hva(vcpu->kvm, gfn), current); > > return 0; > > } > > > > pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need > > to report the liner address of the faulty memory to a userspace here, > > but if gfn is in a RO slot gfn_to_hva() will not return correct address > > here. > > Got it, thanks for your explanation. > > BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :) I am busy with reviews mostly :). If you are not to busy with lockless write protection then fine with me. Lest wait for Paolo's input on proposed API though. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Il 02/09/2013 12:07, Gleb Natapov ha scritto: > On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote: >> On 09/02/2013 05:25 PM, Gleb Natapov wrote: >>> On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: >>>> On 08/30/2013 08:41 PM, Paolo Bonzini wrote: >>>>> Page tables in a read-only memory slot will currently cause a triple >>>>> fault because the page walker uses gfn_to_hva and it fails on such a slot. >>>>> >>>>> OVMF uses such a page table; however, real hardware seems to be fine with >>>>> that as long as the accessed/dirty bits are set. Save whether the slot >>>>> is readonly, and later check it when updating the accessed and dirty bits. >>>> >>>> Paolo, do you know why OVMF is using readonly memory like this? >>>> >>> Just a guess, but perhaps they want to move to paging mode as early as >>> possible even before memory controller is fully initialized. >>> >>>> AFAIK, The fault trigged by this kind of access can hardly be fixed by >>>> userspace since the fault is trigged by pagetable walking not by the current >>>> instruction. Do you have any idea to let uerspace emulate it properly? >>> Not sure what userspace you mean here, but there shouldn't be a fault in the >> >> I just wonder how to fix this kind of fault. The current patch returns -EACCES >> but that will crash the guest. I think we'd better let userspace to fix this >> error (let userspace set the D/A bit.) >> > Ugh, this is not good. Missed that. Don't know what real HW will do > here, but the easy thing for us to do would be to just return success. Real hardware would just do a memory write. What happens depends on what is on the bus, i.e. on what the ROM is used for. QEMU uses read-only slots for two things: actual read-only memory where writes go to the bitbucket, and "ROMD" memory where writes are treated as MMIO. So, in the first case we would ignore the write. In the second we would do an MMIO exit to userspace. But ignoring the write isn't always correct, and doing an MMIO exit is complicated, so I would just kill the guest. EPT will probably write to the read-only slots without thinking much about it. My patch injects a page fault, which is very likely to escalate to a triple fault. This is probably never what you want---on the other hand, I wasn't sure what level of accuracy we want in this case, given that EPT does it wrong too. Paolo >>> first place if ROM page tables have access/dirty bit set and they do. >> >> Yes, so we can not call x86_emulate_instruction() to fix this fault (that function >> emulates the access on the first place). Need directly return a MMIO-exit to >> userpsace when met this fault? What happen if this fault on pagetable-walking >> is trigged in x86_emulate_instruction().? > I think we should not return MMIO-exit to userspace. Either ignore write attempt > or kill a guest. > > -- > Gleb. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Il 02/09/2013 12:11, Gleb Natapov ha scritto: >> > >> > Got it, thanks for your explanation. >> > >> > BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :) > I am busy with reviews mostly :). If you are not to busy with lockless > write protection then fine with me. Lest wait for Paolo's input on > proposed API though. No problem, I'm happy to do this kind of exercise. I like the API you proposed. I can do this change on top of the API change. However, since this bug is "in the wild" it may be applicable to stable@ as well. If you agree, I will also post v2 of this patch for stable kernels as soon as we agree on the desired faulting behavior. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Il 02/09/2013 11:25, Gleb Natapov ha scritto: > On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: >> On 08/30/2013 08:41 PM, Paolo Bonzini wrote: >>> Page tables in a read-only memory slot will currently cause a triple >>> fault because the page walker uses gfn_to_hva and it fails on such a slot. >>> >>> OVMF uses such a page table; however, real hardware seems to be fine with >>> that as long as the accessed/dirty bits are set. Save whether the slot >>> is readonly, and later check it when updating the accessed and dirty bits. >> >> Paolo, do you know why OVMF is using readonly memory like this? >> > Just a guess, but perhaps they want to move to paging mode as early as > possible even before memory controller is fully initialized. More precisely they want to move to 64-bit mode as early as possible, and that requires paging. >> AFAIK, The fault trigged by this kind of access can hardly be fixed by >> userspace since the fault is trigged by pagetable walking not by the current >> instruction. Do you have any idea to let uerspace emulate it properly? > > Not sure what userspace you mean here, but there shouldn't be a fault in the > first place if ROM page tables have access/dirty bit set and they do. Yep. Actually they do not set dirty on the PD/PDP/PML4, only on the bottom level (which is also fine). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 02, 2013 at 06:00:08PM +0200, Paolo Bonzini wrote: > Il 02/09/2013 11:25, Gleb Natapov ha scritto: > > On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: > >> On 08/30/2013 08:41 PM, Paolo Bonzini wrote: > >>> Page tables in a read-only memory slot will currently cause a triple > >>> fault because the page walker uses gfn_to_hva and it fails on such a slot. > >>> > >>> OVMF uses such a page table; however, real hardware seems to be fine with > >>> that as long as the accessed/dirty bits are set. Save whether the slot > >>> is readonly, and later check it when updating the accessed and dirty bits. > >> > >> Paolo, do you know why OVMF is using readonly memory like this? > >> > > Just a guess, but perhaps they want to move to paging mode as early as > > possible even before memory controller is fully initialized. > > More precisely they want to move to 64-bit mode as early as possible, > and that requires paging. > > >> AFAIK, The fault trigged by this kind of access can hardly be fixed by > >> userspace since the fault is trigged by pagetable walking not by the current > >> instruction. Do you have any idea to let uerspace emulate it properly? > > > > Not sure what userspace you mean here, but there shouldn't be a fault in the > > first place if ROM page tables have access/dirty bit set and they do. > > Yep. Actually they do not set dirty on the PD/PDP/PML4, only on the > bottom level (which is also fine). > They do it for a good reason, dirty bit exists only on the bottom level :) -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 02, 2013 at 05:56:07PM +0200, Paolo Bonzini wrote: > Il 02/09/2013 12:07, Gleb Natapov ha scritto: > > On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote: > >> On 09/02/2013 05:25 PM, Gleb Natapov wrote: > >>> On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: > >>>> On 08/30/2013 08:41 PM, Paolo Bonzini wrote: > >>>>> Page tables in a read-only memory slot will currently cause a triple > >>>>> fault because the page walker uses gfn_to_hva and it fails on such a slot. > >>>>> > >>>>> OVMF uses such a page table; however, real hardware seems to be fine with > >>>>> that as long as the accessed/dirty bits are set. Save whether the slot > >>>>> is readonly, and later check it when updating the accessed and dirty bits. > >>>> > >>>> Paolo, do you know why OVMF is using readonly memory like this? > >>>> > >>> Just a guess, but perhaps they want to move to paging mode as early as > >>> possible even before memory controller is fully initialized. > >>> > >>>> AFAIK, The fault trigged by this kind of access can hardly be fixed by > >>>> userspace since the fault is trigged by pagetable walking not by the current > >>>> instruction. Do you have any idea to let uerspace emulate it properly? > >>> Not sure what userspace you mean here, but there shouldn't be a fault in the > >> > >> I just wonder how to fix this kind of fault. The current patch returns -EACCES > >> but that will crash the guest. I think we'd better let userspace to fix this > >> error (let userspace set the D/A bit.) > >> > > Ugh, this is not good. Missed that. Don't know what real HW will do > > here, but the easy thing for us to do would be to just return success. > > Real hardware would just do a memory write. What happens depends on > what is on the bus, i.e. on what the ROM is used for. > > QEMU uses read-only slots for two things: actual read-only memory where > writes go to the bitbucket, and "ROMD" memory where writes are treated > as MMIO. > > So, in the first case we would ignore the write. In the second we would > do an MMIO exit to userspace. But ignoring the write isn't always > correct, and doing an MMIO exit is complicated, so I would just kill the > guest. > > EPT will probably write to the read-only slots without thinking much > about it. > > My patch injects a page fault, which is very likely to escalate to a > triple fault. This is probably never what you want---on the other hand, > I wasn't sure what level of accuracy we want in this case, given that > EPT does it wrong too. > Just ignoring write into RO slot is OK. We do not handle page tables in MMIO memory, so handling page tables in ROMD memory shouldn't be a priority either. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 02, 2013 at 05:58:41PM +0200, Paolo Bonzini wrote: > Il 02/09/2013 12:11, Gleb Natapov ha scritto: > >> > > >> > Got it, thanks for your explanation. > >> > > >> > BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :) > > I am busy with reviews mostly :). If you are not to busy with lockless > > write protection then fine with me. Lest wait for Paolo's input on > > proposed API though. > > No problem, I'm happy to do this kind of exercise. I like the API you > proposed. I can do this change on top of the API change. However, > since this bug is "in the wild" it may be applicable to stable@ as well. > If you agree, I will also post v2 of this patch for stable kernels as > soon as we agree on the desired faulting behavior. > I am OK with having minimal patch for backporting to stable and doing API changes + fixes for other call sites on top. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Il 01/09/2013 11:17, Gleb Natapov ha scritto: > This makes me think are there other places where gfn_to_hva() was > used, but gfn_to_hva_prot() should have been? > - kvm_host_page_size() looks incorrect. We never use huge page to map > read only memory slots currently. > - kvm_handle_bad_page() also looks incorrect and may cause incorrect > address to be reported to userspace. > - kvm_setup_async_pf() also incorrect. Makes all page fault on read > only slot to be sync. > - kvm_vm_fault() one looks OK since function assumes write only slots, > but it is obsolete and should be deleted anyway. > > Others in generic and x86 code looks OK, somebody need to check ppc and > arm code. PPC is a mess, for s390 I couldn't find a spec. ARM needs to be fixed. Patches coming... Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 0433301..dadc5c0 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -99,6 +99,7 @@ struct guest_walker { pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; + bool pte_writable[PT_MAX_FULL_LEVELS]; unsigned pt_access; unsigned pte_access; gfn_t gfn; @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, if (pte == orig_pte) continue; + if (unlikely(!walker->pte_writable[level - 1])) + return -EACCES; + ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); if (ret) return ret; @@ -309,7 +313,8 @@ retry_walk: goto error; real_gfn = gpa_to_gfn(real_gfn); - host_addr = gfn_to_hva(vcpu->kvm, real_gfn); + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn, + &walker->pte_writable[walker->level - 1]); if (unlikely(kvm_is_error_hva(host_addr))) goto error; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ca645a0..22f9cdf 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -533,6 +533,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages, struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn); +unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable); unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn); void kvm_release_page_clean(struct page *page); void kvm_release_page_dirty(struct page *page); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f7e4334..418d037 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1078,11 +1078,15 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn) EXPORT_SYMBOL_GPL(gfn_to_hva); /* - * The hva returned by this function is only allowed to be read. - * It should pair with kvm_read_hva() or kvm_read_hva_atomic(). + * If writable is set to false, the hva returned by this function is only + * allowed to be read. */ -static unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn) +unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable) { + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn); + if (writable) + *writable = !memslot_is_readonly(slot); + return __gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL, false); } @@ -1450,7 +1454,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, int r; unsigned long addr; - addr = gfn_to_hva_read(kvm, gfn); + addr = gfn_to_hva_read(kvm, gfn, NULL); if (kvm_is_error_hva(addr)) return -EFAULT; r = kvm_read_hva(data, (void __user *)addr + offset, len); @@ -1488,7 +1492,7 @@ int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data, gfn_t gfn = gpa >> PAGE_SHIFT; int offset = offset_in_page(gpa); - addr = gfn_to_hva_read(kvm, gfn); + addr = gfn_to_hva_read(kvm, gfn, NULL); if (kvm_is_error_hva(addr)) return -EFAULT; pagefault_disable();
Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Cc: stable@vger.kernel.org Cc: gleb@redhat.com Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- CCing to stable@ since the regression was introduced with support for readonly memory slots. arch/x86/kvm/paging_tmpl.h | 7 ++++++- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 14 +++++++++----- 3 files changed, 16 insertions(+), 6 deletions(-)