Message ID | 50444A07.8060102@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Sep 3, 2012 at 1:11 AM, Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote: > On 09/03/2012 10:09 AM, Hugo wrote: >> On Sun, Sep 2, 2012 at 8:29 AM, Xiao Guangrong >> <xiaoguangrong@linux.vnet.ibm.com> wrote: >>> On 09/01/2012 05:30 AM, Hui Lin (Hugo) wrote: >>>> On Thu, Aug 30, 2012 at 9:54 PM, Xiao Guangrong >>>> <xiaoguangrong@linux.vnet.ibm.com> wrote: >>>>> On 08/31/2012 02:59 AM, Hugo wrote: >>>>>> On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong >>>>>> <xiaoguangrong@linux.vnet.ibm.com> wrote: >>>>>>> On 08/28/2012 11:30 AM, Felix wrote: >>>>>>>> Xiao Guangrong <xiaoguangrong <at> linux.vnet.ibm.com> writes: >>>>>>>> >>>>>>>>> >>>>>>>>> On 07/31/2012 01:18 AM, Sunil wrote: >>>>>>>>>> Hello List, >>>>>>>>>> >>>>>>>>>> I am a KVM newbie and studying KVM mmu code. >>>>>>>>>> >>>>>>>>>> On the existing guest, I am trying to track all guest writes by >>>>>>>>>> marking page table entry as read-only in EPT entry [ I am using Intel >>>>>>>>>> machine with vmx and ept support ]. Looks like EPT support re-uses >>>>>>>>>> shadow page table(SPT) code and hence some of SPT routines. >>>>>>>>>> >>>>>>>>>> I was thinking of below possible approach. Use pte_list_walk() to >>>>>>>>>> traverse through list of sptes and use mmu_spte_update() to flip the >>>>>>>>>> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list; >>>>>>>>>> but on separate lists (based on gfn, page level, memory_slot). So, >>>>>>>>>> recording all the faulted guest GFN and then using above method work ? >>>>>>>>>> >>>>>>>>> >>>>>>>>> There are two ways to write-protect all sptes: >>>>>>>>> - use kvm_mmu_slot_remove_write_access() on all memslots >>>>>>>>> - walk the shadow page cache to get the shadow pages in the highest level >>>>>>>>> (level = 4 on EPT), then write-protect its entries. >>>>>>>>> >>>>>>>>> If you just want to do it for the specified gfn, you can use >>>>>>>>> rmap_write_protect(). >>>>>>>>> >>>>>>>>> Just inquisitive, what is your purpose? :) >>>>>>>>> >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in >>>>>>>>> the body of a message to majordomo <at> vger.kernel.org >>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>>>> >>>>>>>> Hi, Guangrong, >>>>>>>> >>>>>>>> I have done similar things like Sunil did. Simply for study purpose. However, I >>>>>>>> found some very weird situations. Basically, in the guest vm, I allocate a chunk >>>>>>>> of memory (with size of a page) in a user level program. Through a guest kernel >>>>>>>> level module and my self defined hypercall, I pass the gva of this memory to >>>>>>>> kvm. Then I try different methods in the hypercall handler to write protect this >>>>>>>> page of memory. You can see that I want to write protect it through ETP instead >>>>>>>> of write protected in the guest page tables. >>>>>>>> >>>>>>>> 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based on the >>>>>>>> function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the codes to >>>>>>>> read sptep (the pointer to spte) instead of spte, so I can modify the spte >>>>>>>> corresponding to this gpa. What I observe is that if I modify spte[0] (I think >>>>>>>> this is the lowest level page table entry corresponding to EPT table; I can >>>>>>>> successfully modify it as the changes are reflected in the result of calling >>>>>>>> kvm_mmu_get_spte_hierarchy again), but my user level program in vm can still >>>>>>>> write to this page. >>>>>>>> >>>>>>>> In your this blog post, you mentioned (the shadow pages in the highest level >>>>>>>> (level = 4 on EPT)), I don't understand this part. Does this mean I have to >>>>>>>> modify spte[3] instead of spte[0]? I just try modify spte[1] and spte[3], both >>>>>>>> can cause vmexit. So I am totally confused about the meaning of level used in >>>>>>>> shadow page table and its relations to shadow page table. Can you help me to >>>>>>>> understand this? >>>>>>>> >>>>>>>> 2. As suggested by this post, I also use rmap_write_protect() to write protect >>>>>>>> this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I still can see >>>>>>>> that spte[0] gives me xxxxxx005 such result, this means that the function is >>>>>>>> called successfully. But still I can write to this page. >>>>>>>> >>>>>>>> I even try the function kvm_age_hva() to remove this spte, this gives me 0 of >>>>>>>> spte[0], but I still can write to this page. So I am further confused about the >>>>>>>> level used in the shadow page? >>>>>>>> >>>>>>> >>>>>>> kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold spin_lock(&vcpu->kvm->mmu_lock) >>>>>>> and use for_each_shadow_entry instead. And, after change, did you flush all tlbs? >>>>>> >>>>>> I do apply the lock in my codes and I do flush tlb. >>>>>> >>>>>>> >>>>>>> If it can not work, please post your code. >>>>>>> >>>>>> >>>>>> Here is my codes. The modifications are made in x86/x86.c in >>>>>> >>>>>> KVM_HC_HL_EPTPER is my hypercall number. >>>>>> >>>>>> Method 1: >>>>>> >>>>>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ >>>>>> ................ >>>>>> >>>>>> case KVM_HC_HL_EPTPER : >>>>>> //// This method is not working >>>>>> >>>>>> localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); >>>>>> if(localGpa == UNMAPPED_GVA){ >>>>>> printk("read is not correct\n"); >>>>>> return -KVM_ENOSYS; >>>>>> } >>>>>> >>>>>> hl_kvm_mmu_update_spte(vcpu, localGpa, 5); >>>>>> hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, >>>>>> hl_sptes); >>>>>> >>>>>> printk("after changes return result is %d , gpa: %llx >>>>>> sptes: %llx , %llx , %llx , %llx \n", hl_result, localGpa, >>>>>> hl_sptes[0], hl_sptes[1], hl_sptes[2], hl_sptes[3]); >>>>>> kvm_flush_remote_tlbs(vcpu->kvm); >>>>>> ................... >>>>>> } >>>>>> >>>>>> The function hl_kvm_mmu_update_spte is defined as >>>>>> >>>>>> int hl_kvm_mmu_update_spte(struct kvm_vcpu *vcpu, u64 addr, u64 mask) >>>>>> { >>>>>> struct kvm_shadow_walk_iterator iterator; >>>>>> int nr_sptes = 0; >>>>>> u64 sptes[4]; >>>>>> u64* sptep[4]; >>>>>> u64 localMask = 0xFFFFFFFFFFFFFFF8; /// 1000 >>>>>> >>>>>> spin_lock(&vcpu->kvm->mmu_lock); >>>>>> for_each_shadow_entry(vcpu, addr, iterator) { >>>>>> sptes[iterator.level-1] = *iterator.sptep; >>>>>> sptep[iterator.level-1] = iterator.sptep; >>>>>> nr_sptes++; >>>>>> if (!is_shadow_present_pte(*iterator.sptep)) >>>>>> break; >>>>>> } >>>>>> >>>>>> sptes[0] = sptes[0] & localMask; >>>>>> sptes[0] = sptes[0] | mask ; >>>>>> __set_spte(sptep[0], sptes[0]); >>>>>> //update_spte(sptep[0], sptes[0]); >>>>>> /* >>>>>> sptes[1] = sptes[1] & localMask; >>>>>> sptes[1] = sptes[1] | mask ; >>>>>> update_spte(sptep[1], sptes[1]); >>>>>> */ >>>>>> /* >>>>>> >>>>>> sptes[3] = sptes[3] & localMask; >>>>>> sptes[3] = sptes[3] | mask ; >>>>>> update_spte(sptep[3], sptes[3]); >>>>>> */ >>>>>> spin_unlock(&vcpu->kvm->mmu_lock); >>>>>> >>>>>> return nr_sptes; >>>>>> } >>>>>> >>>>>> The execution results are from kern.log >>>>>> >>>>>> xxxx kernel: [ 4371.002579] hypercall f002, a71000 >>>>>> xxxx kernel: [ 4371.002581] after changes return result is 4 , gpa: >>>>>> 723ae000 sptes: 16c7bd275 , 1304c7007 , 136d6f007 , 13cc88007 >>>>>> >>>>>> I find that if I write to this page, actually the write protected >>>>>> permission bit is set as writable again. I am not quite sure why. >>>>>> >>>>>> Method 2: >>>>>> >>>>>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ >>>>>> ................ >>>>>> >>>>>> case KVM_HC_HL_EPTPER : >>>>>> //// This method is not working >>>>>> localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); >>>>>> localGfn = gpa_to_gfn(localGpa); >>>>>> >>>>>> spin_lock(&vcpu->kvm->mmu_lock); >>>>>> hl_result = rmap_write_protect(vcpu->kvm, localGfn); >>>>>> printk("local gfn is %llx , result of kvm_age_hva is >>>>>> %d\n", localGfn, hl_result); >>>>>> kvm_flush_remote_tlbs(vcpu->kvm); >>>>>> spin_unlock(&vcpu->kvm->mmu_lock); >>>>>> >>>>>> hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, >>>>>> hl_sptes); >>>>>> printk("return result is %d , gpa: %llx sptes: %llx , >>>>>> %llx , %llx , %llx \n", hl_result, localGpa, hl_sptes[0], hl_sptes[1], >>>>>> hl_sptes[2], hl_sptes[3]); >>>>>> ................... >>>>>> } >>>>>> >>>>>> The execution results are: >>>>>> >>>>>> xxxx kernel: [ 4044.020816] hypercall f002, 1201000 >>>>>> xxxx kernel: [ 4044.020819] local gfn is 70280 , result of kvm_age_hva is 1 >>>>>> xxxx kernel: [ 4044.020823] return result is 4 , gpa: 70280000 sptes: >>>>>> 13c2aa275 , 1304ff007 , 15eb3d007 , 15eb3e007 >>>>>> >>>>>> My feeling is seems that I have to modify something else instead of spte alone. >>>>> >>>>> Aha. >>>>> >>>>> There two issues i found: >>>>> >>>>> - you should use kvm_mmu_gva_to_gpa_write instead of kvm_mmu_gva_to_gpa_read, since >>>>> if the page in guest is readonly, it will trigger COW and switch to a new page >>>>> >>>>> - you also need to do some work on page fault path to avoid setting W bit on the spte >>>>> >>>> >>>> Thanks for the quick reply. >>>> >>>> BTW, I am using KVM 2.6.32.27 kernel module. And use virt-manager as >>>> the guest module. The host is Ubuntu 10.04 with kernel 2.6.32.33. >>>> >>>> I have changed to use kvm_mmu_gva_to_gpa_write function. >>>> >>>> I am also putting extra printk message into page_fault, >>>> tdp_page_fault, and inject_page_fault, functions, none of them gives >>> >>> Could you show these change please? >> >> What I did in tdp_page_fault and inject_page_fault is simple, >> >> In tdp_page_fault, inject_page_fault, I added the same piece of codes >> at the beginning of the function. The target_gpa is set in x86/x86.c >> by the vmcall handler: >> >> ///// >> if(gpa == target_gpa){ >> printk("XXXX Debug %llx \n", gpa); >> } >> ///// >> >> This way, no crazy kernel logs are made. >>> >>>> me any information if I write to the memory whose spte is changed as >>>> readonly. I also try to trace when the __set_spte is called after I >>> >>> Try to add some debug message in mmu_spte_set and mmu_spte_update >>> >>>> modify the spte. I still don't get any luck. So I really want to know >>>> where the problem is. As Davidlohr mentions, this is a basic technique >>>> that I found in many papers, that is why I used it as a study case. >>> >>> You'd better show what you did in the guest OS. >> What I did in Guest OS includes two parts: >> kernel level: pseudo device driver, includes read and write function. >> The write function accept the virtual address defined in a user >> program. And then pass this virtual address to the KVM through vmcall. >> This is basic device driver module introduced in linux device driver. >> Guest level: >> I allocate a page of memory in the program's address space: >> pagesize = sysconf(_SC_PAGE_SIZE); >> if(pagesize == -1){ >> printf("sysconf error\n"); >> return -1; >> } >> //buffer = (char*)memalign(pagesize, pagesize); >> ori = (char*)malloc(1024 + pagesize - 1); >> if (ori == NULL){ >> printf("memalign\n"); >> return -1; >> } >> buffer = (char *)(((int) ori + pagesize -1) & ~(pagesize-1)); >> address = (unsigned long) buffer; >> Then pass the "address " to the kernel module: >> >> size = write(fd, &address, sizeof(unsigned long)); > > Okay, i have written a test case, it works fine on my box. If it can work > on you machine, you can easily find out what is wrong in your code, if not, > please let me know. > > The code is attached. > > > Thanks very much for your effort. The codes work with no problems. And it even works with my own user level programs. Actually, what I wrote in the hypercall handler is the same as yours. The difference is in tdp_page_fault, I use the gpa to catch the event, but you are using gfn. I think it is probably because, the gpa in tdp_page_fault contains page offset, but the gpa returned by kvm_mmu_gva_to_gpa ends with 000, which is actually gfn << PAGE_SHIFT). But Any way, the codes are finally working. Many thanks. To end this topic, I actually have another finding. I use another method which actually not modify ept, but was mentioned in a CCS paper. What I did is actually find the hva of that gva from the guest and use linux kernel api to write protect that hva. In this situation, kvm_mmu_notifier will disable the corresponding spte thoroughly, and writing to that memory in the guest will cause segment fault of kvm. So I think for the same virtual address in the guest OS, there are two sets of paging tables, EPT is used by the guest OS to avoid vm exit and host paging table is used by kvm. To modify which one totally depends on what you want to do. I find that codes comments in kvm_mmu_notifyer kind of confirm my understanding. Thanks again, it was nice talking to the list. Hugo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c9b6e33..3e32f9b 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1200,7 +1200,7 @@ void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, } } -static bool rmap_write_protect(struct kvm *kvm, u64 gfn) +bool rmap_write_protect(struct kvm *kvm, u64 gfn) { struct kvm_memory_slot *slot; unsigned long *rmapp; @@ -3296,6 +3296,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, return false; } +gfn_t filter_gfn; + static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, bool prefault) { @@ -3311,6 +3313,11 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, ASSERT(vcpu); ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa)); + if (filter_gfn && (filter_gfn == gpa_to_gfn(gpa))) { + printk("Catch gfn %llx.\n", filter_gfn); + return 1; + } + if (unlikely(error_code & PFERR_RSVD_MASK)) return handle_mmio_page_fault(vcpu, gpa, error_code, true); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d44edaa..d3e266c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1759,6 +1759,24 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) return 1; vcpu->arch.osvw.status = data; break; + case 0x99999999: { + extern bool rmap_write_protect(struct kvm *kvm, u64 gfn); + extern gfn_t filter_gfn; + + gpa_t gpa = kvm_mmu_gva_to_gpa_write(vcpu, data, NULL); + if (gpa == UNMAPPED_GVA) { + printk("unmapped gva:%llx.\n", data); + } + + printk("GVA %llx -> GPA:%llx.\n", data, gpa); + filter_gfn = gpa_to_gfn(gpa); + spin_lock(&vcpu->kvm->mmu_lock); + if (rmap_write_protect(vcpu->kvm, filter_gfn)) + kvm_flush_remote_tlbs(vcpu->kvm); + spin_unlock(&vcpu->kvm->mmu_lock); + } + break; + default: if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr)) return xen_hvm_config(vcpu, data);