Message ID | 20170119172941.65642-5-roger.pau@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 19/01/17 17:29, Roger Pau Monne wrote: > +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d) > +{ > + p2m_type_t p2mt; > + uint32_t rc, *ident_pt; > + uint8_t *tss; > + mfn_t mfn; > + paddr_t gaddr; > + unsigned int i; > + > + /* > + * Steal some space from the last found RAM region. One page will be > + * used for the identity page tables, and the remaining space for the > + * VM86 TSS. Note that after this not all e820 regions will be aligned > + * to PAGE_SIZE. > + */ > + if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) ) > + { > + printk("Unable to find memory to stash the identity map and TSS\n"); > + return -ENOMEM; > + } > + > + /* > + * Identity-map page table is required for running with CR0.PG=0 > + * when using Intel EPT. Create a 32-bit non-PAE page directory of > + * superpages. > + */ > + ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), > + &mfn, &p2mt, 0, &rc); > + if ( ident_pt == NULL ) > + { > + printk("Unable to map identity page tables\n"); > + return -ENOMEM; > + } > + for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ ) > + ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER | > + _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE); Please can you make helper for this and dedup it with shadow_enable(). Something like: void write_pse_identmap(uint32_t *l2) rather than duplicating this particular piece of magic. (It can probably even be static inline.) > + unmap_domain_page(ident_pt); > + put_page(mfn_to_page(mfn_x(mfn))); > + d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr; > + gaddr += PAGE_SIZE; > + ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE)); > + > + tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), > + &mfn, &p2mt, 0, &rc); > + if ( tss == NULL ) > + { > + printk("Unable to map VM86 TSS area\n"); > + return 0; > + } > + > + memset(tss, 0, HVM_VM86_TSS_SIZE); Do we actually need to 0 this? Don't we guarantee to hand out zero'd pages during construction? (I can't actually recall. Perhaps it is better to explicitly clear it.) ~Andrew
>>> On 20.01.17 at 20:41, <andrew.cooper3@citrix.com> wrote: > On 19/01/17 17:29, Roger Pau Monne wrote: >> + tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), >> + &mfn, &p2mt, 0, &rc); >> + if ( tss == NULL ) >> + { >> + printk("Unable to map VM86 TSS area\n"); >> + return 0; >> + } >> + >> + memset(tss, 0, HVM_VM86_TSS_SIZE); > > Do we actually need to 0 this? Don't we guarantee to hand out zero'd > pages during construction? (I can't actually recall. Perhaps it is > better to explicitly clear it.) No, we don't zero before handing out, we zero after a reclaiming memory from a dying guest or from the hypervisor. Jan
> > +static int __init modify_identity_mmio(struct domain *d, unsigned long pfn, > + unsigned long nr_pages, bool map) > +{ > + int rc; > + > + for ( ; ; ) > + { > + rc = (map ? map_mmio_regions : unmap_mmio_regions) This can be taken outside the loop. -boris > + (d, _gfn(pfn), nr_pages, _mfn(pfn)); > + if ( rc == 0 ) > + break; > + if ( rc < 0 ) > + { > + printk(XENLOG_WARNING > + "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n", > + map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc); > + break; > + } > + nr_pages -= rc; > + pfn += rc; > + process_pending_softirqs(); > + } > + > + return rc; > +} > +
On Mon, Jan 23, 2017 at 09:11:06AM -0500, Boris Ostrovsky wrote: > > > > > +static int __init modify_identity_mmio(struct domain *d, unsigned long pfn, > > + unsigned long nr_pages, bool map) > > +{ > > + int rc; > > + > > + for ( ; ; ) > > + { > > + rc = (map ? map_mmio_regions : unmap_mmio_regions) > > This can be taken outside the loop. Maybe I can instead make map const, and the compiler should optimize this itself? I find it a little cumbersome to store function pointers, ie: int (*mapf)(struct domain *, gfn_t, unsigned long, mfn_t) = ...; Roger.
>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote: > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages; > static long __initdata dom0_min_nrpages; > static long __initdata dom0_max_nrpages = LONG_MAX; > > +/* Size of the VM86 TSS for virtual 8086 mode to use. */ > +#define HVM_VM86_TSS_SIZE 128 I continue to be puzzled by this value. Why 128? I think this really needs to be clarified in the comment. > @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages( > avail -= max_pdx >> s; > } > > - need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share); > + need_paging = opt_dom0_shadow || > + (has_hvm_container_domain(d) && (!iommu_hap_pt_share || > + !paging_mode_hap(d))); What is the !paging_mode_hap() part good for? It's being taken care of by checking opt_dom0_shadow already, isn't it? Alternatively, to make the distinction more obvious, I'd suggest need_paging = has_hvm_container_domain(d) ? !iommu_hap_pt_share || !paging_mode_hap(d) : opt_dom0_shadow; > @@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages) > continue; > } > > - *entry_guest = *entry; > - pages = PFN_UP(entry_guest->size); > + /* > + * Make sure the start and length are aligned to PAGE_SIZE, because > + * that's the minimum granularity of the 2nd stage translation. Since > + * the p2m code uses PAGE_ORDER_4K internally, also use it here in > + * order to prevent this code from getting out of sync. > + */ > + start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT); You definitely don't need to use _AC() in C code. But the whole thing can anyway simply be start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K); (albeit I'd like to note that if anything we'd have to be prepared for page sizes > 4k, not smaller ones, and the whole idea of PAGE_ORDER_4K breaks in that case). > + end = (entry->addr + entry->size) & > + ~((_AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT) - 1 ); On top of the above, please remove the stray blank from near the end of this statement. > +static int __init pvh_steal_ram(struct domain *d, unsigned long size, > + paddr_t limit, paddr_t *addr) > +{ > + unsigned int i = d->arch.nr_e820; > + > + while ( i-- ) > + { > + struct e820entry *entry = &d->arch.e820[i]; > + > + if ( entry->type != E820_RAM || entry->size < size ) > + continue; > + > + /* Subtract from the beginning. */ > + if ( entry->addr + size <= limit && entry->addr >= MB(1) ) > + { > + *addr = entry->addr; > + entry->addr += size; > + entry->size -= size; The comment says so, but why from the beginning? Wouldn't it be better to steal from the end of the highest range below 4Gb, to keep an overall more conventional layout? > +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d) > +{ > + p2m_type_t p2mt; > + uint32_t rc, *ident_pt; > + uint8_t *tss; > + mfn_t mfn; > + paddr_t gaddr; > + unsigned int i; > + > + /* > + * Steal some space from the last found RAM region. One page will be > + * used for the identity page tables, and the remaining space for the > + * VM86 TSS. Note that after this not all e820 regions will be aligned > + * to PAGE_SIZE. > + */ > + if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) > ) > + { > + printk("Unable to find memory to stash the identity map and TSS\n"); > + return -ENOMEM; > + } > + > + /* > + * Identity-map page table is required for running with CR0.PG=0 > + * when using Intel EPT. Create a 32-bit non-PAE page directory of > + * superpages. > + */ > + ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), > + &mfn, &p2mt, 0, &rc); > + if ( ident_pt == NULL ) > + { > + printk("Unable to map identity page tables\n"); > + return -ENOMEM; > + } > + for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ ) > + ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER | > + _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE); > + unmap_domain_page(ident_pt); > + put_page(mfn_to_page(mfn_x(mfn))); > + d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr; > + gaddr += PAGE_SIZE; > + ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE)); This comes too late - the page table setup above also requires page alignment (and with that, adding PAGE_SIZE would not break the alignment requirement). Even more, the code below doesn't strictly require page alignment, it only requires for the range to not cross a page boundary. > + tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), > + &mfn, &p2mt, 0, &rc); > + if ( tss == NULL ) > + { > + printk("Unable to map VM86 TSS area\n"); > + return 0; > + } > + > + memset(tss, 0, HVM_VM86_TSS_SIZE); > + unmap_domain_page(tss); > + put_page(mfn_to_page(mfn_x(mfn))); > + d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr; > + > + return 0; While I've seen the code a number of times by now, I still can't help disliking the early success return (accompanied by an error message). I think this not being a mistake would be more obvious with if ( tss ) { } else printk(); return 0; > +static int __init pvh_setup_p2m(struct domain *d) > +{ > + struct vcpu *v = d->vcpu[0]; > + unsigned long nr_pages; > + unsigned int i; > + int rc; > + bool preempted; > +#define MB1_PAGES PFN_DOWN(MB(1)) > + > + nr_pages = compute_dom0_nr_pages(d, NULL, 0); > + > + pvh_setup_e820(d, nr_pages); > + do { > + preempted = false; > + paging_set_allocation(d, dom0_paging_pages(d, nr_pages), > + &preempted); > + process_pending_softirqs(); > + } while ( preempted ); > + > + /* > + * Memory below 1MB is identity mapped. > + * NB: this only makes sense when booted from legacy BIOS. > + */ > + rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true); MB1_PAGES > + if ( rc ) > + { > + printk("Failed to identity map low 1MB: %d\n", rc); > + return rc; > + } > + > + /* Populate memory map. */ > + for ( i = 0; i < d->arch.nr_e820; i++ ) > + { > + unsigned long addr, size; > + > + if ( d->arch.e820[i].type != E820_RAM ) > + continue; > + > + addr = PFN_DOWN(d->arch.e820[i].addr); > + size = PFN_DOWN(d->arch.e820[i].size); > + > + ASSERT(addr >= MB1_PAGES || addr + size < MB1_PAGES); > + > + if ( addr >= MB1_PAGES ) > + rc = pvh_populate_memory_range(d, addr, size); > + else > + pvh_steal_low_ram(d, addr, size); Would you mind shortening the ASSERT() expression above by moving it into the else branch here? Jan
At 05:41 -0700 on 26 Jan (1485409318), Jan Beulich wrote: > >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote: > > +/* Size of the VM86 TSS for virtual 8086 mode to use. */ > > +#define HVM_VM86_TSS_SIZE 128 > > I continue to be puzzled by this value. Why 128? I think this really > needs to be clarified in the comment. I was asked on IRC to do some archaeology / explain myself about this, so here goes. First, the _intended_ mechanism for "real mode" guests on older VMX hardware is to run them in virtual 8086 mode inside the guest as much as possible, and emulate whenever we can't do that. This is managed with some state in v->arch.hvm_vmx: - vmx_realmode, set when the guest thinks it's in real mode. - vmx_emulate, to force emulation rather than VMENTER We set this when we have exceptions to inject, as the VMX hardware would try to inject them in 32-bit protected mode. - vm86_segment_mask, a bitmask of segments that can't be fudged to run in virtual 8086 mode. When vmx_realmode is set, vmx_do_vmentry() DTRT: it bails out into the emulator if either vmx_emulate or any bit in vm86_segment_mask is set; otherwise it calls vmx_enter_realmode() to adjust %rflags and enters the guest in virtual 8086 mode. The reason we need a TSS at all is for handling software interrupts. Virtual 8086 mode has two ways to handle software interrupts: stay in virtual 8086 mode and vector via the table @0x0, or raise #GP in 32-bit protected mode. We want the first of those, so that a guest in 'real mode' can make BIOS calls. The CPU uses a bitmap in the TSS to decide which method to use; we need all the bits in that bitmap to be clear. In my SDM (April 2016) this is section 20.3.3 "Class 3 -- Software Interrupt Handling in Virtual-8086 Mode", table 20-2, method 5. --- So far so good, and AIUI the system works -- or at least it did in December 2008 when it was put in (8d4638d1), because emulating every instruction made Windows boot times so slow that we would definitely have noticed. But looking at it now, I'm not convinced of exactly how. The magic bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map base address itself lives at offset 100. A zero'd TSS should mean an I/O map at 0, and an interrupt redirection bitmap at -32, which would plausibly work if the TSS were 256 bytes (matching the limit set in Xen). Perhaps it's only working because the 128 bytes following the TSS in hvmloader happen to be zeros too? I also don't remember why the TSS is 128 rather than 104 bytes. The SDM claims that the TSS must be larger than 104 bytes "when accessing the I/O permission bit map or interrupt redirection bit map." (7.2.2. "TSS Descriptor") but I suspect that just means that the generated address of the bitmap must lie inside the limit. In any case, the limit set in vmx_set_segment_register() should surely match the size of the actual TSS! I haven't got the time or hardware to test this right now, but could maybe look at it next week unless anyone else wants to play with it. Cheers, Tim.
On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote: > >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote: > > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages; > > static long __initdata dom0_min_nrpages; > > static long __initdata dom0_max_nrpages = LONG_MAX; > > > > +/* Size of the VM86 TSS for virtual 8086 mode to use. */ > > +#define HVM_VM86_TSS_SIZE 128 > > I continue to be puzzled by this value. Why 128? I think this really > needs to be clarified in the comment. Given the recent comments by Tim, and that this is starting to look like a can of worms, I would like to leave this as-is for the moment, on the grounds that it's what hvmloader does (I'm not saying it's right), and that this issue should be treated independently from this patch series. Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0. IIRC I've tried that before (without unrestricted mode support) and it was working fine. > > @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages( > > avail -= max_pdx >> s; > > } > > > > - need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share); > > + need_paging = opt_dom0_shadow || > > + (has_hvm_container_domain(d) && (!iommu_hap_pt_share || > > + !paging_mode_hap(d))); > > What is the !paging_mode_hap() part good for? It's being taken care > of by checking opt_dom0_shadow already, isn't it? Alternatively, to > make the distinction more obvious, I'd suggest > > need_paging = has_hvm_container_domain(d) > ? !iommu_hap_pt_share || !paging_mode_hap(d) > : opt_dom0_shadow; AFAICT it *might* be possible to run a PVHv2 Dom0 on a box with no EPT, but with an IOMMU? Does that exist? In that case opt_dom0_shadow won't be set, but paging_mode_hap would be false. Maybe that's just an impossible combination in any case... > > @@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages) > > continue; > > } > > > > - *entry_guest = *entry; > > - pages = PFN_UP(entry_guest->size); > > + /* > > + * Make sure the start and length are aligned to PAGE_SIZE, because > > + * that's the minimum granularity of the 2nd stage translation. Since > > + * the p2m code uses PAGE_ORDER_4K internally, also use it here in > > + * order to prevent this code from getting out of sync. > > + */ > > + start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT); > > You definitely don't need to use _AC() in C code. But the whole thing > can anyway simply be > > start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K); > > (albeit I'd like to note that if anything we'd have to be prepared > for page sizes > 4k, not smaller ones, and the whole idea of > PAGE_ORDER_4K breaks in that case). Thanks, I will change as per your recommendation above, although I'm not sure what to do with the PAGE_ORDER_4K thing. Are you fine with leaving it like you suggest? > > + end = (entry->addr + entry->size) & > > + ~((_AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT) - 1 ); > > On top of the above, please remove the stray blank from near > the end of this statement. I've changed that to: end = (entry->addr + entry->size) & ~((PAGE_SIZE << PAGE_ORDER_4K) - 1); In order to match with the above. > > +static int __init pvh_steal_ram(struct domain *d, unsigned long size, > > + paddr_t limit, paddr_t *addr) > > +{ > > + unsigned int i = d->arch.nr_e820; > > + > > + while ( i-- ) > > + { > > + struct e820entry *entry = &d->arch.e820[i]; > > + > > + if ( entry->type != E820_RAM || entry->size < size ) > > + continue; > > + > > + /* Subtract from the beginning. */ > > + if ( entry->addr + size <= limit && entry->addr >= MB(1) ) > > + { > > + *addr = entry->addr; > > + entry->addr += size; > > + entry->size -= size; > > The comment says so, but why from the beginning? Wouldn't it be > better to steal from the end of the highest range below 4Gb, to > keep an overall more conventional layout? That sounds sensible, let me change it to: /* Subtract from the end. */ if ( entry->addr + entry->size + size <= limit && entry->addr >= MB(1) ) { entry->size -= size; *addr = entry->addr + entry->size; return 0; } This is going to involve some changes in pvh_setup_vmx_realmode_helpers, see below. > > +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d) > > +{ > > + p2m_type_t p2mt; > > + uint32_t rc, *ident_pt; > > + uint8_t *tss; > > + mfn_t mfn; > > + paddr_t gaddr; > > + unsigned int i; > > + > > + /* > > + * Steal some space from the last found RAM region. One page will be > > + * used for the identity page tables, and the remaining space for the > > + * VM86 TSS. Note that after this not all e820 regions will be aligned > > + * to PAGE_SIZE. > > + */ > > + if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) > > ) > > + { > > + printk("Unable to find memory to stash the identity map and TSS\n"); > > + return -ENOMEM; > > + } > > + > > + /* > > + * Identity-map page table is required for running with CR0.PG=0 > > + * when using Intel EPT. Create a 32-bit non-PAE page directory of > > + * superpages. > > + */ > > + ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), > > + &mfn, &p2mt, 0, &rc); > > + if ( ident_pt == NULL ) > > + { > > + printk("Unable to map identity page tables\n"); > > + return -ENOMEM; > > + } > > + for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ ) > > + ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER | > > + _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE); > > + unmap_domain_page(ident_pt); > > + put_page(mfn_to_page(mfn_x(mfn))); > > + d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr; > > + gaddr += PAGE_SIZE; > > + ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE)); > > This comes too late - the page table setup above also requires > page alignment (and with that, adding PAGE_SIZE would not break > the alignment requirement). Even more, the code below doesn't > strictly require page alignment, it only requires for the range to > not cross a page boundary. Given the change that you requested in pvh_steal_ram, now the start of the memory area returned by it it's not going to be page-aligned, so I will have to perform the TSS setup first, and then the identity page tables. > > + tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), > > + &mfn, &p2mt, 0, &rc); > > + if ( tss == NULL ) > > + { > > + printk("Unable to map VM86 TSS area\n"); > > + return 0; > > + } > > + > > + memset(tss, 0, HVM_VM86_TSS_SIZE); > > + unmap_domain_page(tss); > > + put_page(mfn_to_page(mfn_x(mfn))); > > + d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr; > > + > > + return 0; > > While I've seen the code a number of times by now, I still can't > help disliking the early success return (accompanied by an error > message). I think this not being a mistake would be more obvious > with > > if ( tss ) > { > } > else > printk(); > return 0; That's not a problem, I will change it given that I will also have to move this before the setup of the identity page tables. > > +static int __init pvh_setup_p2m(struct domain *d) > > +{ > > + struct vcpu *v = d->vcpu[0]; > > + unsigned long nr_pages; > > + unsigned int i; > > + int rc; > > + bool preempted; > > +#define MB1_PAGES PFN_DOWN(MB(1)) > > + > > + nr_pages = compute_dom0_nr_pages(d, NULL, 0); > > + > > + pvh_setup_e820(d, nr_pages); > > + do { > > + preempted = false; > > + paging_set_allocation(d, dom0_paging_pages(d, nr_pages), > > + &preempted); > > + process_pending_softirqs(); > > + } while ( preempted ); > > + > > + /* > > + * Memory below 1MB is identity mapped. > > + * NB: this only makes sense when booted from legacy BIOS. > > + */ > > + rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true); > > MB1_PAGES > > > + if ( rc ) > > + { > > + printk("Failed to identity map low 1MB: %d\n", rc); > > + return rc; > > + } > > + > > + /* Populate memory map. */ > > + for ( i = 0; i < d->arch.nr_e820; i++ ) > > + { > > + unsigned long addr, size; > > + > > + if ( d->arch.e820[i].type != E820_RAM ) > > + continue; > > + > > + addr = PFN_DOWN(d->arch.e820[i].addr); > > + size = PFN_DOWN(d->arch.e820[i].size); > > + > > + ASSERT(addr >= MB1_PAGES || addr + size < MB1_PAGES); > > + > > + if ( addr >= MB1_PAGES ) > > + rc = pvh_populate_memory_range(d, addr, size); > > + else > > + pvh_steal_low_ram(d, addr, size); > > Would you mind shortening the ASSERT() expression above by > moving it into the else branch here? Fixed both of the above, thanks. Roger.
On Fri, Jan 27, 2017 at 11:14:10AM +0000, Tim Deegan wrote: > At 05:41 -0700 on 26 Jan (1485409318), Jan Beulich wrote: > > >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote: > > > +/* Size of the VM86 TSS for virtual 8086 mode to use. */ > > > +#define HVM_VM86_TSS_SIZE 128 > > > > I continue to be puzzled by this value. Why 128? I think this really > > needs to be clarified in the comment. > > I was asked on IRC to do some archaeology / explain myself about this, > so here goes. > > First, the _intended_ mechanism for "real mode" guests on older VMX > hardware is to run them in virtual 8086 mode inside the guest as much > as possible, and emulate whenever we can't do that. > > This is managed with some state in v->arch.hvm_vmx: > - vmx_realmode, set when the guest thinks it's in real mode. > - vmx_emulate, to force emulation rather than VMENTER > We set this when we have exceptions to inject, as the VMX hardware > would try to inject them in 32-bit protected mode. > - vm86_segment_mask, a bitmask of segments that can't be fudged > to run in virtual 8086 mode. > > When vmx_realmode is set, vmx_do_vmentry() DTRT: it bails out into the > emulator if either vmx_emulate or any bit in vm86_segment_mask is set; > otherwise it calls vmx_enter_realmode() to adjust %rflags and enters > the guest in virtual 8086 mode. > > The reason we need a TSS at all is for handling software interrupts. > Virtual 8086 mode has two ways to handle software interrupts: stay in > virtual 8086 mode and vector via the table @0x0, or raise #GP in 32-bit > protected mode. We want the first of those, so that a guest in 'real mode' > can make BIOS calls. > > The CPU uses a bitmap in the TSS to decide which method to use; we > need all the bits in that bitmap to be clear. In my SDM (April 2016) > this is section 20.3.3 "Class 3 -- Software Interrupt Handling in > Virtual-8086 Mode", table 20-2, method 5. > > --- > > So far so good, and AIUI the system works -- or at least it did in > December 2008 when it was put in (8d4638d1), because emulating every > instruction made Windows boot times so slow that we would definitely > have noticed. > > But looking at it now, I'm not convinced of exactly how. The magic > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map > base address itself lives at offset 100. A zero'd TSS should mean an > I/O map at 0, and an interrupt redirection bitmap at -32, which would > plausibly work if the TSS were 256 bytes (matching the limit set in > Xen). Perhaps it's only working because the 128 bytes following the > TSS in hvmloader happen to be zeros too? Right, so *if* this was working as intended, the interrupt bitmap would be at HVM_PARAM_VM86_TSS - 32, which we don't guarantee to zero at all. I've also looked at the manual, and it states that the last bit of the IO bitmap should be filled with 1s[0], which we don't do at all. Also, what's the expected size of the IO bitmap, 64KB? Roger. [0] Vol3, section 20.3.3 "Class 3-Software Interrupt Handling in Virtual-8086 Mode, Figure 20-5.
On 27/01/17 11:14, Tim Deegan wrote: > At 05:41 -0700 on 26 Jan (1485409318), Jan Beulich wrote: >>>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote: >>> +/* Size of the VM86 TSS for virtual 8086 mode to use. */ >>> +#define HVM_VM86_TSS_SIZE 128 >> I continue to be puzzled by this value. Why 128? I think this really >> needs to be clarified in the comment. > I was asked on IRC to do some archaeology / explain myself about this, > so here goes. > > First, the _intended_ mechanism for "real mode" guests on older VMX > hardware is to run them in virtual 8086 mode inside the guest as much > as possible, and emulate whenever we can't do that. > > This is managed with some state in v->arch.hvm_vmx: > - vmx_realmode, set when the guest thinks it's in real mode. > - vmx_emulate, to force emulation rather than VMENTER > We set this when we have exceptions to inject, as the VMX hardware > would try to inject them in 32-bit protected mode. > - vm86_segment_mask, a bitmask of segments that can't be fudged > to run in virtual 8086 mode. > > When vmx_realmode is set, vmx_do_vmentry() DTRT: it bails out into the > emulator if either vmx_emulate or any bit in vm86_segment_mask is set; > otherwise it calls vmx_enter_realmode() to adjust %rflags and enters > the guest in virtual 8086 mode. Ah - this is where I went wrong. I'd logically combined vmx_enter_realmode and vmx_realmode when reading the assembly. > > The reason we need a TSS at all is for handling software interrupts. > Virtual 8086 mode has two ways to handle software interrupts: stay in > virtual 8086 mode and vector via the table @0x0, or raise #GP in 32-bit > protected mode. We want the first of those, so that a guest in 'real mode' > can make BIOS calls. > > The CPU uses a bitmap in the TSS to decide which method to use; we > need all the bits in that bitmap to be clear. In my SDM (April 2016) > this is section 20.3.3 "Class 3 -- Software Interrupt Handling in > Virtual-8086 Mode", table 20-2, method 5. > > --- > > So far so good, and AIUI the system works -- or at least it did in > December 2008 when it was put in (8d4638d1), because emulating every > instruction made Windows boot times so slow that we would definitely > have noticed. > > But looking at it now, I'm not convinced of exactly how. The magic > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map > base address itself lives at offset 100. A zero'd TSS should mean an > I/O map at 0, and an interrupt redirection bitmap at -32, which would > plausibly work if the TSS were 256 bytes (matching the limit set in > Xen). Perhaps it's only working because the 128 bytes following the > TSS in hvmloader happen to be zeros too? With an IO_base_map of 0, the software interrupt bitmap will end up being ahead of the TSS, not after it. I would not be surprised if this turns out that microcode doesn't range check against TSS.base. > I also don't remember why the TSS is 128 rather than 104 bytes. The > SDM claims that the TSS must be larger than 104 bytes "when accessing > the I/O permission bit map or interrupt redirection bit map." > (7.2.2. "TSS Descriptor") but I suspect that just means that the > generated address of the bitmap must lie inside the limit. The documented way of expressing "no IO bitmap" is to set the map base to a value which exceeds the TSS limit. All this means (I think) is that you must make a larger than default TSS if you want to use a IO or software interrupt bitmap. > In any case, the limit set in vmx_set_segment_register() should surely > match the size of the actual TSS. > I haven't got the time or hardware to test this right now, but could > maybe look at it next week unless anyone else wants to play with it. I have hardware. I will look into it when I have a moment, unless anyone beats me to it. ~Andrew
Hi, At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote: > On 27/01/17 11:14, Tim Deegan wrote: > > But looking at it now, I'm not convinced of exactly how. The magic > > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map > > base address itself lives at offset 100. A zero'd TSS should mean an > > I/O map at 0, and an interrupt redirection bitmap at -32, which would > > plausibly work if the TSS were 256 bytes (matching the limit set in > > Xen). Perhaps it's only working because the 128 bytes following the > > TSS in hvmloader happen to be zeros too? > > With an IO_base_map of 0, the software interrupt bitmap will end up > being ahead of the TSS, not after it. I should have thought that the segmented address calculation would wrap and leave us at TSS + 224. > > I also don't remember why the TSS is 128 rather than 104 bytes. The > > SDM claims that the TSS must be larger than 104 bytes "when accessing > > the I/O permission bit map or interrupt redirection bit map." > > (7.2.2. "TSS Descriptor") but I suspect that just means that the > > generated address of the bitmap must lie inside the limit. > > The documented way of expressing "no IO bitmap" is to set the map base > to a value which exceeds the TSS limit. All this means (I think) is > that you must make a larger than default TSS if you want to use a IO or > software interrupt bitmap. Yes, I wonder about the I/O bitmap too. We don't provide one, or even enough space for a full one, but the current SDM is pretty clear that the CPU will try to check it in virtual 8086 mode. It may be that all the ports actually used happen to fall in the 128 bytes of zeros that we provide. Or possibly (both for this and the interrupt bitmap) we are causing #GP and somehow ending up exiting-and-emulating. But I don't see quite what the path is for that. Cheers, Tim.
On 27/01/17 13:20, Tim Deegan wrote: > Hi, > > At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote: >> On 27/01/17 11:14, Tim Deegan wrote: >>> But looking at it now, I'm not convinced of exactly how. The magic >>> bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map >>> base address itself lives at offset 100. A zero'd TSS should mean an >>> I/O map at 0, and an interrupt redirection bitmap at -32, which would >>> plausibly work if the TSS were 256 bytes (matching the limit set in >>> Xen). Perhaps it's only working because the 128 bytes following the >>> TSS in hvmloader happen to be zeros too? >> With an IO_base_map of 0, the software interrupt bitmap will end up >> being ahead of the TSS, not after it. > I should have thought that the segmented address calculation would > wrap and leave us at TSS + 224. As far as I am aware, this is the only case of a system descriptor access which could end up negative (relative to base). All IDT/GDT/LDT accesses are sensibly bounded by the validity of their trigger conditions. I'd expect microcode to calculate TSS.base + I/O base - 32 + bit_of(vector) on the expectation that an OS actually wanting this to work would have set it up properly. The actual behaviour can be determined by putting the TSS on a page boundary, making the previous frame non-readable via EPT, and seeing whether an EPT violation occurs. (I haven't yet got far enough in my nested virt work for this to be an easy thing to configure, but it is possible by manually clobbering unrestricted mode on a newer processor and using HAP.) > >>> I also don't remember why the TSS is 128 rather than 104 bytes. The >>> SDM claims that the TSS must be larger than 104 bytes "when accessing >>> the I/O permission bit map or interrupt redirection bit map." >>> (7.2.2. "TSS Descriptor") but I suspect that just means that the >>> generated address of the bitmap must lie inside the limit. >> The documented way of expressing "no IO bitmap" is to set the map base >> to a value which exceeds the TSS limit. All this means (I think) is >> that you must make a larger than default TSS if you want to use a IO or >> software interrupt bitmap. > Yes, I wonder about the I/O bitmap too. We don't provide one, or even > enough space for a full one, but the current SDM is pretty clear that > the CPU will try to check it in virtual 8086 mode. > > It may be that all the ports actually used happen to fall in the 128 > bytes of zeros that we provide. With an offset of 0, we actually provide 256 bytes of zeros in the bitmap within the TSS limit. > Or possibly (both for this and the interrupt bitmap) we are causing > #GP and somehow ending up exiting-and-emulating. But I don't see > quite what the path is for that. We set IOPL to 3 as well as when entering vm86 to fake up real mode. This bypasses all I/O bitmap checks (a properly common to ring 3 protected tasks as well - See specifically 20.2.7 "Sensitive Instructions"), which means the IN/OUT instructions end up directly at the relevant vmexit case. ~Andrew
Hi, At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote: > The actual behaviour can be determined by putting the TSS on a page > boundary, making the previous frame non-readable via EPT, and seeing > whether an EPT violation occurs. Indeed. Or likewise with normal pagetables. > > Yes, I wonder about the I/O bitmap too. We don't provide one, or even > > enough space for a full one, but the current SDM is pretty clear that > > the CPU will try to check it in virtual 8086 mode. > > > > It may be that all the ports actually used happen to fall in the 128 > > bytes of zeros that we provide. > > With an offset of 0, we actually provide 256 bytes of zeros in the > bitmap within the TSS limit. Sure, or at least 128 bytes of zeros and another 128 bytes of something. > > Or possibly (both for this and the interrupt bitmap) we are causing > > #GP and somehow ending up exiting-and-emulating. But I don't see > > quite what the path is for that. > > We set IOPL to 3 as well as when entering vm86 to fake up real mode. > This bypasses all I/O bitmap checks (a properly common to ring 3 > protected tasks as well - See specifically 20.2.7 "Sensitive > Instructions"), which means the IN/OUT instructions end up directly at > the relevant vmexit case. 20.2.8.1 makes it clear that this is not the case -- in virtual 8086 mode all IN/OUT ops check the bitmap event with IOPL == CPL. Tim.
On 27/01/17 14:01, Tim Deegan wrote: > Hi, > > At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote: >> The actual behaviour can be determined by putting the TSS on a page >> boundary, making the previous frame non-readable via EPT, and seeing >> whether an EPT violation occurs. > Indeed. Or likewise with normal pagetables. > >>> Yes, I wonder about the I/O bitmap too. We don't provide one, or even >>> enough space for a full one, but the current SDM is pretty clear that >>> the CPU will try to check it in virtual 8086 mode. >>> >>> It may be that all the ports actually used happen to fall in the 128 >>> bytes of zeros that we provide. >> With an offset of 0, we actually provide 256 bytes of zeros in the >> bitmap within the TSS limit. > Sure, or at least 128 bytes of zeros and another 128 bytes of something. That is a good point. Nothing prevents a guest exiting vm86 mode, and using a task switch to move to a new tss, which will cause Xen to write state back into the vm86_tss, making it no longer a zeroed block of memory. Despite being owned by the guest, this TSS is actually managed by Xen. It should be initialised to defaults each time Xen needs to use it on behalf of the guest. >>> Or possibly (both for this and the interrupt bitmap) we are causing >>> #GP and somehow ending up exiting-and-emulating. But I don't see >>> quite what the path is for that. >> We set IOPL to 3 as well as when entering vm86 to fake up real mode. >> This bypasses all I/O bitmap checks (a properly common to ring 3 >> protected tasks as well - See specifically 20.2.7 "Sensitive >> Instructions"), which means the IN/OUT instructions end up directly at >> the relevant vmexit case. > 20.2.8.1 makes it clear that this is not the case -- in virtual 8086 > mode all IN/OUT ops check the bitmap event with IOPL == CPL. Hmm. Right you area, which explains why the TSS limit is greater than 0x67. If the emulation code were working correctly, the emulator should come to the same conclusion as hardware and inject a #GP fault. I suspect it is more likely that RomBIOS doesn't use a port higher than we have bitmap space for. ~Andrew
>>> On 27.01.17 at 13:23, <roger.pau@citrix.com> wrote: > On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote: >> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote: >> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages; >> > static long __initdata dom0_min_nrpages; >> > static long __initdata dom0_max_nrpages = LONG_MAX; >> > >> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */ >> > +#define HVM_VM86_TSS_SIZE 128 >> >> I continue to be puzzled by this value. Why 128? I think this really >> needs to be clarified in the comment. > > Given the recent comments by Tim, and that this is starting to look like a can > of worms, I would like to leave this as-is for the moment, on the grounds that > it's what hvmloader does (I'm not saying it's right), and that this issue > should be treated independently from this patch series. Well, for the purpose of this patch it would be sufficient if the comment referred to hvmloader. But then I think I saw you set the TSS limit to 0x67, which is neither in line with the value above nor - according to what Tim said (but I didn't check myself yet) - the 255 used in hvmloader. I.e. if you clone hvmloader code, all aspects of it should match. > Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0. > IIRC I've tried that before (without unrestricted mode support) and it was > working fine. Now if that's the case, then why bother with the TSS? >> > @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages( >> > avail -= max_pdx >> s; >> > } >> > >> > - need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share); >> > + need_paging = opt_dom0_shadow || >> > + (has_hvm_container_domain(d) && (!iommu_hap_pt_share || >> > + !paging_mode_hap(d))); >> >> What is the !paging_mode_hap() part good for? It's being taken care >> of by checking opt_dom0_shadow already, isn't it? Alternatively, to >> make the distinction more obvious, I'd suggest >> >> need_paging = has_hvm_container_domain(d) >> ? !iommu_hap_pt_share || !paging_mode_hap(d) >> : opt_dom0_shadow; > > AFAICT it *might* be possible to run a PVHv2 Dom0 on a box with no EPT, but > with an IOMMU? Does that exist? In that case opt_dom0_shadow won't be set, but > paging_mode_hap would be false. Maybe that's just an impossible combination in > any case... At least when running Xen itself virtualized, I wouldn't dare to assume this is an impossible combination. However, I can't see how that case would be handled any different by the original or the suggested replacement expressions: need_paging would get set either way afaict. >> > @@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages) >> > continue; >> > } >> > >> > - *entry_guest = *entry; >> > - pages = PFN_UP(entry_guest->size); >> > + /* >> > + * Make sure the start and length are aligned to PAGE_SIZE, because >> > + * that's the minimum granularity of the 2nd stage translation. Since >> > + * the p2m code uses PAGE_ORDER_4K internally, also use it here in >> > + * order to prevent this code from getting out of sync. >> > + */ >> > + start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT); >> >> You definitely don't need to use _AC() in C code. But the whole thing >> can anyway simply be >> >> start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K); >> >> (albeit I'd like to note that if anything we'd have to be prepared >> for page sizes > 4k, not smaller ones, and the whole idea of >> PAGE_ORDER_4K breaks in that case). > > Thanks, I will change as per your recommendation above, although I'm not sure > what to do with the PAGE_ORDER_4K thing. Are you fine with leaving it like you > suggest? Yes, there's far more broken code in that case, and hence the remark was in parentheses in an attempt to make clear it's really just a remark. >> > +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d) >> > +{ >> > + p2m_type_t p2mt; >> > + uint32_t rc, *ident_pt; >> > + uint8_t *tss; >> > + mfn_t mfn; >> > + paddr_t gaddr; >> > + unsigned int i; >> > + >> > + /* >> > + * Steal some space from the last found RAM region. One page will be >> > + * used for the identity page tables, and the remaining space for the >> > + * VM86 TSS. Note that after this not all e820 regions will be aligned >> > + * to PAGE_SIZE. >> > + */ >> > + if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) ) >> > + { >> > + printk("Unable to find memory to stash the identity map and TSS\n"); >> > + return -ENOMEM; >> > + } >> > + >> > + /* >> > + * Identity-map page table is required for running with CR0.PG=0 >> > + * when using Intel EPT. Create a 32-bit non-PAE page directory of >> > + * superpages. >> > + */ >> > + ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), >> > + &mfn, &p2mt, 0, &rc); >> > + if ( ident_pt == NULL ) >> > + { >> > + printk("Unable to map identity page tables\n"); >> > + return -ENOMEM; >> > + } >> > + for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ ) >> > + ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER | >> > + _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE); >> > + unmap_domain_page(ident_pt); >> > + put_page(mfn_to_page(mfn_x(mfn))); >> > + d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr; >> > + gaddr += PAGE_SIZE; >> > + ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE)); >> >> This comes too late - the page table setup above also requires >> page alignment (and with that, adding PAGE_SIZE would not break >> the alignment requirement). Even more, the code below doesn't >> strictly require page alignment, it only requires for the range to >> not cross a page boundary. > > Given the change that you requested in pvh_steal_ram, now the start of the > memory area returned by it it's not going to be page-aligned, so I will have to > perform the TSS setup first, and then the identity page tables. Or simply pass the required alignment. Jan
On Fri, Jan 27, 2017 at 08:11:56AM -0700, Jan Beulich wrote: > >>> On 27.01.17 at 13:23, <roger.pau@citrix.com> wrote: > > On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote: > >> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote: > >> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages; > >> > static long __initdata dom0_min_nrpages; > >> > static long __initdata dom0_max_nrpages = LONG_MAX; > >> > > >> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */ > >> > +#define HVM_VM86_TSS_SIZE 128 > >> > >> I continue to be puzzled by this value. Why 128? I think this really > >> needs to be clarified in the comment. > > > > Given the recent comments by Tim, and that this is starting to look like a can > > of worms, I would like to leave this as-is for the moment, on the grounds that > > it's what hvmloader does (I'm not saying it's right), and that this issue > > should be treated independently from this patch series. > > Well, for the purpose of this patch it would be sufficient if the > comment referred to hvmloader. But then I think I saw you set the > TSS limit to 0x67, which is neither in line with the value above nor Hm, no, I'm not setting the limit anywhere here, this is done in vmx_set_segment_register, and it's indeed set to 0xff which is wrong for hvmloader too according to the conversation that's going on related to this HVM_VM86_TSS_SIZE param. > - according to what Tim said (but I didn't check myself yet) - the > 255 used in hvmloader. I.e. if you clone hvmloader code, all > aspects of it should match. > > > Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0. > > IIRC I've tried that before (without unrestricted mode support) and it was > > working fine. > > Now if that's the case, then why bother with the TSS? It seems like it working was just luck, but I don't know all the details. Maybe the emulator is somehow fixing this up when the TSS is corrupted/incorrect? > >> > @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages( > >> > avail -= max_pdx >> s; > >> > } > >> > > >> > - need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share); > >> > + need_paging = opt_dom0_shadow || > >> > + (has_hvm_container_domain(d) && (!iommu_hap_pt_share || > >> > + !paging_mode_hap(d))); > >> > >> What is the !paging_mode_hap() part good for? It's being taken care > >> of by checking opt_dom0_shadow already, isn't it? Alternatively, to > >> make the distinction more obvious, I'd suggest > >> > >> need_paging = has_hvm_container_domain(d) > >> ? !iommu_hap_pt_share || !paging_mode_hap(d) > >> : opt_dom0_shadow; > > > > AFAICT it *might* be possible to run a PVHv2 Dom0 on a box with no EPT, but > > with an IOMMU? Does that exist? In that case opt_dom0_shadow won't be set, but > > paging_mode_hap would be false. Maybe that's just an impossible combination in > > any case... > > At least when running Xen itself virtualized, I wouldn't dare to assume > this is an impossible combination. However, I can't see how that case > would be handled any different by the original or the suggested > replacement expressions: need_paging would get set either way afaict. Oh yes, sorry, my reply was to the "What is the !paging_mode_hap() part good for?" question. I've changed setting need_paging as you suggested. > > Given the change that you requested in pvh_steal_ram, now the start of the > > memory area returned by it it's not going to be page-aligned, so I will have to > > perform the TSS setup first, and then the identity page tables. > > Or simply pass the required alignment. Passing an alignment here would mean that pvh_steal_ram would have to return 2 pages in order to meet this alignment, and we would end up wasting memory. Also, this is the only caller of pvh_steal_ram that requires alignment. This is what I have after changing pvh_steal_ram to remove RAM from the end of the region: static int __init pvh_setup_vmx_realmode_helpers(struct domain *d) { p2m_type_t p2mt; uint32_t rc, *ident_pt; uint8_t *tss; mfn_t mfn; paddr_t gaddr; /* * Steal some space from the last found RAM region. One page will be * used for the identity page tables, and the remaining space for the * VM86 TSS. Note that after this not all e820 regions will be aligned * to PAGE_SIZE. */ if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, GB(4), &gaddr) ) { printk("Unable to find memory to stash the identity map and TSS\n"); return -ENOMEM; } tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), &mfn, &p2mt, 0, &rc); if ( tss ) { memset(tss, 0, HVM_VM86_TSS_SIZE); unmap_domain_page(tss); put_page(mfn_to_page(mfn_x(mfn))); d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr; } else printk("Unable to map VM86 TSS area\n"); gaddr += HVM_VM86_TSS_SIZE; ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE)); /* * Identity-map page table is required for running with CR0.PG=0 * when using Intel EPT. Create a 32-bit non-PAE page directory of * superpages. */ ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), &mfn, &p2mt, 0, &rc); if ( ident_pt == NULL ) { printk("Unable to map identity page tables\n"); return -ENOMEM; } write_32bit_pse_identmap(ident_pt); unmap_domain_page(ident_pt); put_page(mfn_to_page(mfn_x(mfn))); d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr; return 0; } Roger.
>>> On 27.01.17 at 17:04, <roger.pau@citrix.com> wrote: > On Fri, Jan 27, 2017 at 08:11:56AM -0700, Jan Beulich wrote: >> >>> On 27.01.17 at 13:23, <roger.pau@citrix.com> wrote: >> > On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote: >> >> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote: >> >> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages; >> >> > static long __initdata dom0_min_nrpages; >> >> > static long __initdata dom0_max_nrpages = LONG_MAX; >> >> > >> >> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */ >> >> > +#define HVM_VM86_TSS_SIZE 128 >> >> >> >> I continue to be puzzled by this value. Why 128? I think this really >> >> needs to be clarified in the comment. >> > >> > Given the recent comments by Tim, and that this is starting to look like a can >> > of worms, I would like to leave this as-is for the moment, on the grounds that >> > it's what hvmloader does (I'm not saying it's right), and that this issue >> > should be treated independently from this patch series. >> >> Well, for the purpose of this patch it would be sufficient if the >> comment referred to hvmloader. But then I think I saw you set the >> TSS limit to 0x67, which is neither in line with the value above nor > > Hm, no, I'm not setting the limit anywhere here, this is done in > vmx_set_segment_register, Well, you do, in patch 8 (in pvh_setup_cpus()). But that's a different TSS, so the limits are independent. It's just what I had in mind here. > and it's indeed set to 0xff which is wrong for > hvmloader too according to the conversation that's going on related to this > HVM_VM86_TSS_SIZE param. Right. >> - according to what Tim said (but I didn't check myself yet) - the >> 255 used in hvmloader. I.e. if you clone hvmloader code, all >> aspects of it should match. >> >> > Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0. >> > IIRC I've tried that before (without unrestricted mode support) and it was >> > working fine. >> >> Now if that's the case, then why bother with the TSS? > > It seems like it working was just luck, but I don't know all the details. Maybe > the emulator is somehow fixing this up when the TSS is corrupted/incorrect? I don't think so. Btw, why is the kernel dropping back into real mode anyway? It's being started in protected mode after all. >> > Given the change that you requested in pvh_steal_ram, now the start of the >> > memory area returned by it it's not going to be page-aligned, so I will have to >> > perform the TSS setup first, and then the identity page tables. >> >> Or simply pass the required alignment. > > Passing an alignment here would mean that pvh_steal_ram would have to return 2 > pages in order to meet this alignment, and we would end up wasting memory. > Also, this is the only caller of pvh_steal_ram that requires alignment. This is > what I have after changing pvh_steal_ram to remove RAM from the end of the > region: > > static int __init pvh_setup_vmx_realmode_helpers(struct domain *d) > { > p2m_type_t p2mt; > uint32_t rc, *ident_pt; > uint8_t *tss; > mfn_t mfn; > paddr_t gaddr; > > /* > * Steal some space from the last found RAM region. One page will be > * used for the identity page tables, and the remaining space for the > * VM86 TSS. Note that after this not all e820 regions will be aligned > * to PAGE_SIZE. > */ > if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, GB(4), &gaddr) ) > { > printk("Unable to find memory to stash the identity map and TSS\n"); > return -ENOMEM; > } > > tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), > &mfn, &p2mt, 0, &rc); > if ( tss ) > { > memset(tss, 0, HVM_VM86_TSS_SIZE); > unmap_domain_page(tss); > put_page(mfn_to_page(mfn_x(mfn))); > d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr; > } > else > printk("Unable to map VM86 TSS area\n"); > > gaddr += HVM_VM86_TSS_SIZE; > ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE)); And this assert holds merely because, prior to this function running, all E820 entries are page aligned? That's rather fragile then. Considering that getting into here is going to be increasingly unlikely going forward, I don't think we should be afraid of wasting a little bit of memory here. Jan
>>> On 27.01.17 at 14:20, <tim@xen.org> wrote: > At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote: >> On 27/01/17 11:14, Tim Deegan wrote: >> > But looking at it now, I'm not convinced of exactly how. The magic >> > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map >> > base address itself lives at offset 100. A zero'd TSS should mean an >> > I/O map at 0, and an interrupt redirection bitmap at -32, which would >> > plausibly work if the TSS were 256 bytes (matching the limit set in >> > Xen). Perhaps it's only working because the 128 bytes following the >> > TSS in hvmloader happen to be zeros too? >> >> With an IO_base_map of 0, the software interrupt bitmap will end up >> being ahead of the TSS, not after it. > > I should have thought that the segmented address calculation would > wrap and leave us at TSS + 224. I don't think wrapping takes the limit value into account. It's all linear address calculations, and as Andrew says the assumption in microcode likely is that things will be set up properly by any OS interested in using the interrupt bitmap. >> > I also don't remember why the TSS is 128 rather than 104 bytes. The >> > SDM claims that the TSS must be larger than 104 bytes "when accessing >> > the I/O permission bit map or interrupt redirection bit map." >> > (7.2.2. "TSS Descriptor") but I suspect that just means that the >> > generated address of the bitmap must lie inside the limit. >> >> The documented way of expressing "no IO bitmap" is to set the map base >> to a value which exceeds the TSS limit. All this means (I think) is >> that you must make a larger than default TSS if you want to use a IO or >> software interrupt bitmap. > > Yes, I wonder about the I/O bitmap too. We don't provide one, or even > enough space for a full one, but the current SDM is pretty clear that > the CPU will try to check it in virtual 8086 mode. > > It may be that all the ports actually used happen to fall in the 128 > bytes of zeros that we provide. I suppose so: This is precisely enough for the ISA port range. So what we'll need to do then, as I understand it from the discussion so far: - vmx_set_segment_register() will need to set a correct limit - vmx_set_segment_register() should initialize the TSS every time (including setting the I/O bitmap address to no lower than 32) - hvmloader's init_vm86_tss() will need to allocate 160 bytes rather than 128 (and we should expose this number, so that Roger can also use it) Perhaps we should even introduce a hypercall for hvmloader to query the needed value, rather than exposing a hardcoded number? Jan
On 27/01/17 16:40, Jan Beulich wrote: >>>> On 27.01.17 at 14:20, <tim@xen.org> wrote: >> At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote: >>> On 27/01/17 11:14, Tim Deegan wrote: >>>> But looking at it now, I'm not convinced of exactly how. The magic >>>> bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map >>>> base address itself lives at offset 100. A zero'd TSS should mean an >>>> I/O map at 0, and an interrupt redirection bitmap at -32, which would >>>> plausibly work if the TSS were 256 bytes (matching the limit set in >>>> Xen). Perhaps it's only working because the 128 bytes following the >>>> TSS in hvmloader happen to be zeros too? >>> With an IO_base_map of 0, the software interrupt bitmap will end up >>> being ahead of the TSS, not after it. >> I should have thought that the segmented address calculation would >> wrap and leave us at TSS + 224. > I don't think wrapping takes the limit value into account. It's all > linear address calculations, and as Andrew says the assumption > in microcode likely is that things will be set up properly by any > OS interested in using the interrupt bitmap. > >>>> I also don't remember why the TSS is 128 rather than 104 bytes. The >>>> SDM claims that the TSS must be larger than 104 bytes "when accessing >>>> the I/O permission bit map or interrupt redirection bit map." >>>> (7.2.2. "TSS Descriptor") but I suspect that just means that the >>>> generated address of the bitmap must lie inside the limit. >>> The documented way of expressing "no IO bitmap" is to set the map base >>> to a value which exceeds the TSS limit. All this means (I think) is >>> that you must make a larger than default TSS if you want to use a IO or >>> software interrupt bitmap. >> Yes, I wonder about the I/O bitmap too. We don't provide one, or even >> enough space for a full one, but the current SDM is pretty clear that >> the CPU will try to check it in virtual 8086 mode. >> >> It may be that all the ports actually used happen to fall in the 128 >> bytes of zeros that we provide. > I suppose so: This is precisely enough for the ISA port range. > > So what we'll need to do then, as I understand it from the > discussion so far: > > - vmx_set_segment_register() will need to set a correct limit > - vmx_set_segment_register() should initialize the TSS every > time (including setting the I/O bitmap address to no lower > than 32) > - hvmloader's init_vm86_tss() will need to allocate 160 bytes > rather than 128 (and we should expose this number, so that > Roger can also use it) > > Perhaps we should even introduce a hypercall for hvmloader > to query the needed value, rather than exposing a hardcoded > number? I suggest we remove all responsibility of managing this from hvmloader. The only thing hvmloader does is allocate space for it, and reserve it in the E820. It is conceptually related to IDENT_PT, although the IDENT_PT must be allocated and filled in by the domain builder for the HVM guest to function. It would be cleaner for the domain builder to also allocate an adjacent page for the VM86_TSS when it constructs the IDENT_PT. All HVMLoader needs to do is read the two hvmparams and adjust the E820 table suitably. Finally, the IO bitmap needs to be a fraction larger than 160 bytes. From tools/firmware/rombios/rombios.h: #define PANIC_PORT 0x400 #define PANIC_PORT2 0x401 #define INFO_PORT 0x402 #define DEBUG_PORT 0x403 which are just above the ISA range. I'd also just allocate a full page for it; no OS is going to bother trying to use fractions of a page around an E820 reserved region. ~Andrew
> Despite being owned by the guest, this TSS is actually managed by Xen. > It should be initialised to defaults each time Xen needs to use it on > behalf of the guest. At 14:35 +0000 on 27 Jan (1485527708), Andrew Cooper wrote: > On 27/01/17 14:01, Tim Deegan wrote: > > Hi, > > > > At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote: > >> The actual behaviour can be determined by putting the TSS on a page > >> boundary, making the previous frame non-readable via EPT, and seeing > >> whether an EPT violation occurs. > > Indeed. Or likewise with normal pagetables. > > > >>> Yes, I wonder about the I/O bitmap too. We don't provide one, or even > >>> enough space for a full one, but the current SDM is pretty clear that > >>> the CPU will try to check it in virtual 8086 mode. > >>> > >>> It may be that all the ports actually used happen to fall in the 128 > >>> bytes of zeros that we provide. > >> With an offset of 0, we actually provide 256 bytes of zeros in the > >> bitmap within the TSS limit. > > Sure, or at least 128 bytes of zeros and another 128 bytes of something. > > That is a good point. Nothing prevents a guest exiting vm86 mode, and > using a task switch to move to a new tss, which will cause Xen to write > state back into the vm86_tss, making it no longer a zeroed block of memory. > > Despite being owned by the guest, this TSS is actually managed by Xen. > It should be initialised to defaults each time Xen needs to use it on > behalf of the guest. But it's already in an E820 reserved block - if the guest overwrites it (with a task switch or otherwise) it will break real-mode support, but this is no worse than nobbling any other part of the BIOS state. If we're making it non-zero, I can see an argument for having Xen init the contents once (maybe when the HVM param is written?) so that it matches what Xen expects of it. But resetting it every time we use it would be overkill. > >> We set IOPL to 3 as well as when entering vm86 to fake up real mode. > >> This bypasses all I/O bitmap checks (a properly common to ring 3 > >> protected tasks as well - See specifically 20.2.7 "Sensitive > >> Instructions"), which means the IN/OUT instructions end up directly at > >> the relevant vmexit case. > > 20.2.8.1 makes it clear that this is not the case -- in virtual 8086 > > mode all IN/OUT ops check the bitmap event with IOPL == CPL. > > Hmm. Right you area, which explains why the TSS limit is greater than > 0x67. > > If the emulation code were working correctly, the emulator should come > to the same conclusion as hardware and inject a #GP fault. I don't think so -- the emulator is emulating actual real-mode, not virtual 8086 mode, so it shouldn't fault on any IO port accesses. Cheers, Tim.
At 09:40 -0700 on 27 Jan (1485510008), Jan Beulich wrote: > >>> On 27.01.17 at 14:20, <tim@xen.org> wrote: > > At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote: > >> On 27/01/17 11:14, Tim Deegan wrote: > >> > But looking at it now, I'm not convinced of exactly how. The magic > >> > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map > >> > base address itself lives at offset 100. A zero'd TSS should mean an > >> > I/O map at 0, and an interrupt redirection bitmap at -32, which would > >> > plausibly work if the TSS were 256 bytes (matching the limit set in > >> > Xen). Perhaps it's only working because the 128 bytes following the > >> > TSS in hvmloader happen to be zeros too? > >> > >> With an IO_base_map of 0, the software interrupt bitmap will end up > >> being ahead of the TSS, not after it. > > > > I should have thought that the segmented address calculation would > > wrap and leave us at TSS + 224. > > I don't think wrapping takes the limit value into account. Quite right, I'm talking nonsense. > - vmx_set_segment_register() will need to set a correct limit Yep. > - vmx_set_segment_register() should initialize the TSS every > time (including setting the I/O bitmap address to no lower > than 32) Probably to no lower than 136, to avoid having the bits of that field itself appearing in either the IO or interrupt bitmap. > - hvmloader's init_vm86_tss() will need to allocate 160 bytes > rather than 128 (and we should expose this number, so that > Roger can also use it) > > Perhaps we should even introduce a hypercall for hvmloader > to query the needed value, rather than exposing a hardcoded > number? I think Andrew's suggestion of just using a whole page is a good one. The TSS is a 32-bit one, after all, and doesn't need to live in BIOS space. Cheers, Tim.
>>> On 27.01.17 at 20:43, <tim@xen.org> wrote: >> Despite being owned by the guest, this TSS is actually managed by > Xen. >> It should be initialised to defaults each time Xen needs to use it > on >> behalf of the guest. > > At 14:35 +0000 on 27 Jan (1485527708), Andrew Cooper wrote: >> On 27/01/17 14:01, Tim Deegan wrote: >> > Hi, >> > >> > At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote: >> >> The actual behaviour can be determined by putting the TSS on a page >> >> boundary, making the previous frame non-readable via EPT, and seeing >> >> whether an EPT violation occurs. >> > Indeed. Or likewise with normal pagetables. >> > >> >>> Yes, I wonder about the I/O bitmap too. We don't provide one, or even >> >>> enough space for a full one, but the current SDM is pretty clear that >> >>> the CPU will try to check it in virtual 8086 mode. >> >>> >> >>> It may be that all the ports actually used happen to fall in the 128 >> >>> bytes of zeros that we provide. >> >> With an offset of 0, we actually provide 256 bytes of zeros in the >> >> bitmap within the TSS limit. >> > Sure, or at least 128 bytes of zeros and another 128 bytes of something. >> >> That is a good point. Nothing prevents a guest exiting vm86 mode, and >> using a task switch to move to a new tss, which will cause Xen to write >> state back into the vm86_tss, making it no longer a zeroed block of memory. >> >> Despite being owned by the guest, this TSS is actually managed by Xen. >> It should be initialised to defaults each time Xen needs to use it on >> behalf of the guest. > > But it's already in an E820 reserved block - if the guest overwrites > it (with a task switch or otherwise) it will break real-mode support, > but this is no worse than nobbling any other part of the BIOS state. > > If we're making it non-zero, I can see an argument for having Xen init > the contents once (maybe when the HVM param is written?) so that it > matches what Xen expects of it. But resetting it every time we use it > would be overkill. That wasn't the point Andrew was making, I think. A task switch initiated by the guest would make the hypervisor write into that TSS (as the outgoing one). Of course any sane guest would do an LTR first (or else it would risk memory near address zero to get clobbered on real hardware). Jan
On 30/01/17 10:43, Jan Beulich wrote: >>>> On 27.01.17 at 20:43, <tim@xen.org> wrote: >>> Despite being owned by the guest, this TSS is actually managed by >> Xen. >>> It should be initialised to defaults each time Xen needs to use it >> on >>> behalf of the guest. >> At 14:35 +0000 on 27 Jan (1485527708), Andrew Cooper wrote: >>> On 27/01/17 14:01, Tim Deegan wrote: >>>> Hi, >>>> >>>> At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote: >>>>> The actual behaviour can be determined by putting the TSS on a page >>>>> boundary, making the previous frame non-readable via EPT, and seeing >>>>> whether an EPT violation occurs. >>>> Indeed. Or likewise with normal pagetables. >>>> >>>>>> Yes, I wonder about the I/O bitmap too. We don't provide one, or even >>>>>> enough space for a full one, but the current SDM is pretty clear that >>>>>> the CPU will try to check it in virtual 8086 mode. >>>>>> >>>>>> It may be that all the ports actually used happen to fall in the 128 >>>>>> bytes of zeros that we provide. >>>>> With an offset of 0, we actually provide 256 bytes of zeros in the >>>>> bitmap within the TSS limit. >>>> Sure, or at least 128 bytes of zeros and another 128 bytes of something. >>> That is a good point. Nothing prevents a guest exiting vm86 mode, and >>> using a task switch to move to a new tss, which will cause Xen to write >>> state back into the vm86_tss, making it no longer a zeroed block of memory. >>> >>> Despite being owned by the guest, this TSS is actually managed by Xen. >>> It should be initialised to defaults each time Xen needs to use it on >>> behalf of the guest. >> But it's already in an E820 reserved block - if the guest overwrites >> it (with a task switch or otherwise) it will break real-mode support, >> but this is no worse than nobbling any other part of the BIOS state. >> >> If we're making it non-zero, I can see an argument for having Xen init >> the contents once (maybe when the HVM param is written?) so that it >> matches what Xen expects of it. But resetting it every time we use it >> would be overkill. > That wasn't the point Andrew was making, I think. A task switch > initiated by the guest would make the hypervisor write into that > TSS (as the outgoing one). Of course any sane guest would do an > LTR first (or else it would risk memory near address zero to get > clobbered on real hardware). Thinking about it, this depends on whether we properly save and restore the protected mode %tr around entering and exiting faked-up real mode. If the saving and restoring is already done properly, then I think my concern is unfounded. ~Andrew
>>> On 27.01.17 at 20:48, <tim@xen.org> wrote: > At 09:40 -0700 on 27 Jan (1485510008), Jan Beulich wrote: >> - vmx_set_segment_register() should initialize the TSS every >> time (including setting the I/O bitmap address to no lower >> than 32) > > Probably to no lower than 136, to avoid having the bits of that field > itself appearing in either the IO or interrupt bitmap. Indeed. >> - hvmloader's init_vm86_tss() will need to allocate 160 bytes >> rather than 128 (and we should expose this number, so that >> Roger can also use it) >> >> Perhaps we should even introduce a hypercall for hvmloader >> to query the needed value, rather than exposing a hardcoded >> number? > > I think Andrew's suggestion of just using a whole page is a good > one. The TSS is a 32-bit one, after all, and doesn't need to live in > BIOS space. Hmm, any size increase will need to come with further changes, as it looks, including the use of a new HVM param: The VM86_TSS param is being migrated, and hence for an incoming VM we need to be able to tell whether the guest has set aside 128 bytes or a full page. This of course implies that we need to keep Xen handle the 128-byte case, too. And if we somehow expect that a single page may not suffice in the future, it may even be advisable to store an (address,size) pair as param. Jan
>>> On 27.01.17 at 19:06, <andrew.cooper3@citrix.com> wrote: > On 27/01/17 16:40, Jan Beulich wrote: >> So what we'll need to do then, as I understand it from the >> discussion so far: >> >> - vmx_set_segment_register() will need to set a correct limit >> - vmx_set_segment_register() should initialize the TSS every >> time (including setting the I/O bitmap address to no lower >> than 32) >> - hvmloader's init_vm86_tss() will need to allocate 160 bytes >> rather than 128 (and we should expose this number, so that >> Roger can also use it) >> >> Perhaps we should even introduce a hypercall for hvmloader >> to query the needed value, rather than exposing a hardcoded >> number? > > I suggest we remove all responsibility of managing this from hvmloader. > The only thing hvmloader does is allocate space for it, and reserve it > in the E820. While I did it that way for now, I'm no longer convinced this is useful. With multiple vCPU-s, a guest can do whatever it wants to this TSS anyway, regardless of whether Xen currently thinks it's using a suitably initialized memory block. And whatever the guest does, any non-zero bit in that area will only slow it down (due to the VM exits resulting from the #GP faults caused by those 1 bits, resulting in the respective I/O or INTnn insns being carried out by the emulator). > It is conceptually related to IDENT_PT, although the IDENT_PT must be > allocated and filled in by the domain builder for the HVM guest to > function. It would be cleaner for the domain builder to also allocate > an adjacent page for the VM86_TSS when it constructs the IDENT_PT. I'll leave that for someone else to carry out; for now allocation will remain in hvmloader. > Finally, the IO bitmap needs to be a fraction larger than 160 bytes. > > From tools/firmware/rombios/rombios.h: > > #define PANIC_PORT 0x400 > #define PANIC_PORT2 0x401 > #define INFO_PORT 0x402 > #define DEBUG_PORT 0x403 > > which are just above the ISA range. Which causes only slowness (due to needing the emulator to carry out the instruction), but no lack of functionality. > I'd also just allocate a full page > for it; no OS is going to bother trying to use fractions of a page > around an E820 reserved region. But the smaller range may well be part of an already partially used page. Together with the fact that any port accesses not covered by the bitmap would still be correctly handled, I'd prefer to make the TSS 0x68 + 0x20 + 0x80 + 1 bytes large (base structure plus interrupt redirection bitmap plus I/O bitmap plus trailing byte), which, due to the goal of avoiding page boundaries in the middle, would mean a 512 byte block aligned to a 512-byte boundary. Jan
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c index 4d555b1..fbce1c2 100644 --- a/xen/arch/x86/domain_build.c +++ b/xen/arch/x86/domain_build.c @@ -22,6 +22,7 @@ #include <xen/compat.h> #include <xen/libelf.h> #include <xen/pfn.h> +#include <xen/guest_access.h> #include <asm/regs.h> #include <asm/system.h> #include <asm/io.h> @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages; static long __initdata dom0_min_nrpages; static long __initdata dom0_max_nrpages = LONG_MAX; +/* Size of the VM86 TSS for virtual 8086 mode to use. */ +#define HVM_VM86_TSS_SIZE 128 + /* * dom0_mem=[min:<min_amt>,][max:<max_amt>,][<amt>] * @@ -244,11 +248,12 @@ boolean_param("ro-hpet", ro_hpet); #define round_pgup(_p) (((_p)+(PAGE_SIZE-1))&PAGE_MASK) #define round_pgdown(_p) ((_p)&PAGE_MASK) +static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node; + static struct page_info * __init alloc_chunk( struct domain *d, unsigned long max_pages) { static unsigned int __initdata last_order = MAX_ORDER; - static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node; struct page_info *page; unsigned int order = get_order_from_pages(max_pages), free_order; @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages( avail -= max_pdx >> s; } - need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share); + need_paging = opt_dom0_shadow || + (has_hvm_container_domain(d) && (!iommu_hap_pt_share || + !paging_mode_hap(d))); for ( ; ; need_paging = 0 ) { nr_pages = dom0_nrpages; @@ -365,7 +372,8 @@ static unsigned long __init compute_dom0_nr_pages( avail -= dom0_paging_pages(d, nr_pages); } - if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) && + if ( is_pv_domain(d) && + (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) && ((dom0_min_nrpages <= 0) || (nr_pages > min_pages)) ) { /* @@ -581,6 +589,7 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages) struct e820entry *entry, *entry_guest; unsigned int i; unsigned long pages, cur_pages = 0; + uint64_t start, end; /* * Craft the e820 memory map for Dom0 based on the hardware e820 map. @@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages) continue; } - *entry_guest = *entry; - pages = PFN_UP(entry_guest->size); + /* + * Make sure the start and length are aligned to PAGE_SIZE, because + * that's the minimum granularity of the 2nd stage translation. Since + * the p2m code uses PAGE_ORDER_4K internally, also use it here in + * order to prevent this code from getting out of sync. + */ + start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT); + end = (entry->addr + entry->size) & + ~((_AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT) - 1 ); + if ( start >= end ) + continue; + + entry_guest->type = E820_RAM; + entry_guest->addr = start; + entry_guest->size = end - start; + pages = PFN_DOWN(entry_guest->size); if ( (cur_pages + pages) > nr_pages ) { /* Truncate region */ @@ -1680,15 +1703,281 @@ out: return rc; } +static int __init modify_identity_mmio(struct domain *d, unsigned long pfn, + unsigned long nr_pages, bool map) +{ + int rc; + + for ( ; ; ) + { + rc = (map ? map_mmio_regions : unmap_mmio_regions) + (d, _gfn(pfn), nr_pages, _mfn(pfn)); + if ( rc == 0 ) + break; + if ( rc < 0 ) + { + printk(XENLOG_WARNING + "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n", + map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc); + break; + } + nr_pages -= rc; + pfn += rc; + process_pending_softirqs(); + } + + return rc; +} + +/* Populate an HVM memory range using the biggest possible order. */ +static int __init pvh_populate_memory_range(struct domain *d, + unsigned long start, + unsigned long nr_pages) +{ + unsigned int order, i = 0; + struct page_info *page; + int rc; +#define MAP_MAX_ITER 64 + + order = MAX_ORDER; + while ( nr_pages != 0 ) + { + unsigned int range_order = get_order_from_pages(nr_pages + 1); + + order = min(range_order ? range_order - 1 : 0, order); + page = alloc_domheap_pages(d, order, memflags); + if ( page == NULL ) + { + if ( order == 0 && memflags ) + { + /* Try again without any memflags. */ + memflags = 0; + order = MAX_ORDER; + continue; + } + if ( order == 0 ) + { + printk("Unable to allocate memory with order 0!\n"); + return -ENOMEM; + } + order--; + continue; + } + + rc = guest_physmap_add_page(d, _gfn(start), _mfn(page_to_mfn(page)), + order); + if ( rc != 0 ) + { + printk("Failed to populate memory: [%#lx,%lx): %d\n", + start, start + (1UL << order), rc); + return -ENOMEM; + } + start += 1UL << order; + nr_pages -= 1UL << order; + if ( (++i % MAP_MAX_ITER) == 0 ) + process_pending_softirqs(); + } + + return 0; +#undef MAP_MAX_ITER +} + +static int __init pvh_steal_ram(struct domain *d, unsigned long size, + paddr_t limit, paddr_t *addr) +{ + unsigned int i = d->arch.nr_e820; + + while ( i-- ) + { + struct e820entry *entry = &d->arch.e820[i]; + + if ( entry->type != E820_RAM || entry->size < size ) + continue; + + /* Subtract from the beginning. */ + if ( entry->addr + size <= limit && entry->addr >= MB(1) ) + { + *addr = entry->addr; + entry->addr += size; + entry->size -= size; + return 0; + } + } + + return -ENOMEM; +} + +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d) +{ + p2m_type_t p2mt; + uint32_t rc, *ident_pt; + uint8_t *tss; + mfn_t mfn; + paddr_t gaddr; + unsigned int i; + + /* + * Steal some space from the last found RAM region. One page will be + * used for the identity page tables, and the remaining space for the + * VM86 TSS. Note that after this not all e820 regions will be aligned + * to PAGE_SIZE. + */ + if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) ) + { + printk("Unable to find memory to stash the identity map and TSS\n"); + return -ENOMEM; + } + + /* + * Identity-map page table is required for running with CR0.PG=0 + * when using Intel EPT. Create a 32-bit non-PAE page directory of + * superpages. + */ + ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), + &mfn, &p2mt, 0, &rc); + if ( ident_pt == NULL ) + { + printk("Unable to map identity page tables\n"); + return -ENOMEM; + } + for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ ) + ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER | + _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE); + unmap_domain_page(ident_pt); + put_page(mfn_to_page(mfn_x(mfn))); + d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr; + gaddr += PAGE_SIZE; + ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE)); + + tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)), + &mfn, &p2mt, 0, &rc); + if ( tss == NULL ) + { + printk("Unable to map VM86 TSS area\n"); + return 0; + } + + memset(tss, 0, HVM_VM86_TSS_SIZE); + unmap_domain_page(tss); + put_page(mfn_to_page(mfn_x(mfn))); + d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr; + + return 0; +} + +static void __init pvh_steal_low_ram(struct domain *d, unsigned long start, + unsigned long nr_pages) +{ + unsigned long mfn; + + ASSERT(start + nr_pages <= PFN_DOWN(MB(1))); + + for ( mfn = start; mfn < start + nr_pages; mfn++ ) + { + struct page_info *pg = mfn_to_page(mfn); + int rc; + + rc = unshare_xen_page_with_guest(pg, dom_io); + if ( rc ) + { + printk("Unable to unshare Xen mfn %#lx: %d\n", mfn, rc); + continue; + } + + share_xen_page_with_guest(pg, d, XENSHARE_writable); + rc = guest_physmap_add_entry(d, _gfn(mfn), _mfn(mfn), 0, p2m_ram_rw); + if ( rc ) + printk("Unable to add mfn %#lx to p2m: %d\n", mfn, rc); + } +} + +static int __init pvh_setup_p2m(struct domain *d) +{ + struct vcpu *v = d->vcpu[0]; + unsigned long nr_pages; + unsigned int i; + int rc; + bool preempted; +#define MB1_PAGES PFN_DOWN(MB(1)) + + nr_pages = compute_dom0_nr_pages(d, NULL, 0); + + pvh_setup_e820(d, nr_pages); + do { + preempted = false; + paging_set_allocation(d, dom0_paging_pages(d, nr_pages), + &preempted); + process_pending_softirqs(); + } while ( preempted ); + + /* + * Memory below 1MB is identity mapped. + * NB: this only makes sense when booted from legacy BIOS. + */ + rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true); + if ( rc ) + { + printk("Failed to identity map low 1MB: %d\n", rc); + return rc; + } + + /* Populate memory map. */ + for ( i = 0; i < d->arch.nr_e820; i++ ) + { + unsigned long addr, size; + + if ( d->arch.e820[i].type != E820_RAM ) + continue; + + addr = PFN_DOWN(d->arch.e820[i].addr); + size = PFN_DOWN(d->arch.e820[i].size); + + ASSERT(addr >= MB1_PAGES || addr + size < MB1_PAGES); + + if ( addr >= MB1_PAGES ) + rc = pvh_populate_memory_range(d, addr, size); + else + pvh_steal_low_ram(d, addr, size); + + if ( rc ) + return rc; + } + + if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) ) + { + /* + * Since Dom0 cannot be migrated, we will only setup the + * unrestricted guest helpers if they are needed by the current + * hardware we are running on. + */ + rc = pvh_setup_vmx_realmode_helpers(d); + if ( rc ) + return rc; + } + + return 0; +#undef MB1_PAGES +} + static int __init construct_dom0_pvh(struct domain *d, const module_t *image, unsigned long image_headroom, module_t *initrd, void *(*bootstrap_map)(const module_t *), char *cmdline) { + int rc; printk("** Building a PVH Dom0 **\n"); + iommu_hwdom_init(d); + + rc = pvh_setup_p2m(d); + if ( rc ) + { + printk("Failed to setup Dom0 physical memory map\n"); + return rc; + } + return 0; } diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index a5521f1..721a587 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -475,6 +475,22 @@ void share_xen_page_with_guest( spin_unlock(&d->page_alloc_lock); } +int __init unshare_xen_page_with_guest(struct page_info *page, + struct domain *d) +{ + if ( page_get_owner(page) != d || !is_xen_heap_page(page) ) + return -EINVAL; + + if ( test_and_clear_bit(_PGC_allocated, &page->count_info) ) + put_page(page); + + /* Remove the owner and clear the flags. */ + page->u.inuse.type_info = 0; + page_set_owner(page, NULL); + + return 0; +} + void share_xen_page_with_privileged_guests( struct page_info *page, int readonly) { diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h index 93a073d..3d02ebb 100644 --- a/xen/include/asm-x86/mm.h +++ b/xen/include/asm-x86/mm.h @@ -276,6 +276,8 @@ struct spage_info #define XENSHARE_readonly 1 extern void share_xen_page_with_guest( struct page_info *page, struct domain *d, int readonly); +extern int unshare_xen_page_with_guest(struct page_info *page, + struct domain *d); extern void share_xen_page_with_privileged_guests( struct page_info *page, int readonly); extern void free_shared_domheap_page(struct page_info *page);
Craft the Dom0 e820 memory map and populate it. Introduce a helper to remove memory pages that are shared between Xen and a domain, and use it in order to remove low 1MB RAM regions from dom_io in order to assign them to a PVHv2 Dom0. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> --- Changes since v4: - Move process_pending_softirqs to previous patch. - Fix off-by-one errors in some checks. - Make unshare_xen_page_with_guest __init. - Improve unshare_xen_page_with_guest by making use of already existing is_xen_heap_page and put_page. - s/hvm/pvh/. - Use PAGE_ORDER_4K in pvh_setup_e820 in order to keep consistency with the p2m code. Changes since v3: - Drop get_order_from_bytes_floor, it was only used by hvm_populate_memory_range. - Switch hvm_populate_memory_range to use frame numbers instead of full memory addresses. - Add a helper to steal the low 1MB RAM areas from dom_io and add them to Dom0 as normal RAM. - Introduce unshare_xen_page_with_guest in order to remove pages from dom_io, so they can be assigned to other domains. This is needed in order to remove the low 1MB RAM regions from dom_io and assign them to the hardware_domain. - Simplify the loop in hvm_steal_ram. - Move definition of map_identity_mmio into this patch. Changes since v2: - Introduce get_order_from_bytes_floor as a local function to domain_build.c. - Remove extra asserts. - Make hvm_populate_memory_range return an error code instead of panicking. - Fix comments and printks. - Use ULL sufix instead of casting to uint64_t. - Rename hvm_setup_vmx_unrestricted_guest to hvm_setup_vmx_realmode_helpers. - Only substract two pages from the memory calculation, that will be used by the MADT replacement. - Remove some comments. - Remove printing allocation information. - Don't stash any pages for the MADT, TSS or ident PT, those will be subtracted directly from RAM regions of the memory map. - Count the number of iterations before calling process_pending_softirqs when populating the memory map. - Move the initial call to process_pending_softirqs into construct_dom0, and remove the ones from construct_dom0_hvm and construct_dom0_pv. - Make memflags global so it can be shared between alloc_chunk and hvm_populate_memory_range. Changes since RFC: - Use IS_ALIGNED instead of checking with PAGE_MASK. - Use the new %pB specifier in order to print sizes in human readable form. - Create a VM86 TSS for hardware that doesn't support unrestricted mode. - Subtract guest RAM for the identity page table and the VM86 TSS. - Split the creation of the unrestricted mode helper structures to a separate function. - Use preemption with paging_set_allocation. - Use get_order_from_bytes_floor. --- xen/arch/x86/domain_build.c | 299 +++++++++++++++++++++++++++++++++++++++++++- xen/arch/x86/mm.c | 16 +++ xen/include/asm-x86/mm.h | 2 + 3 files changed, 312 insertions(+), 5 deletions(-)