[v5,4/9] xen/x86: populate PVHv2 Dom0 physical memory map

Message ID	20170119172941.65642-5-roger.pau@citrix.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> From: Roger Pau Monne <roger.pau@citrix.com> To: <xen-devel@lists.xenproject.org> Date: Thu, 19 Jan 2017 17:29:36 +0000 Message-ID: <20170119172941.65642-5-roger.pau@citrix.com> In-Reply-To: <20170119172941.65642-1-roger.pau@citrix.com> References: <20170119172941.65642-1-roger.pau@citrix.com> MIME-Version: 1.0 Cc: Andrew Cooper <andrew.cooper3@citrix.com>, boris.ostrovsky@oracle.com, Roger Pau Monne <roger.pau@citrix.com>, Jan Beulich <jbeulich@suse.com>, konrad.wilk@oracle.com Subject: [Xen-devel] [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Roger Pau Monne Jan. 19, 2017, 5:29 p.m. UTC

Craft the Dom0 e820 memory map and populate it. Introduce a helper to remove
memory pages that are shared between Xen and a domain, and use it in order to
remove low 1MB RAM regions from dom_io in order to assign them to a PVHv2 Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - Move process_pending_softirqs to previous patch.
 - Fix off-by-one errors in some checks.
 - Make unshare_xen_page_with_guest __init.
 - Improve unshare_xen_page_with_guest by making use of already existing
   is_xen_heap_page and put_page.
 - s/hvm/pvh/.
 - Use PAGE_ORDER_4K in pvh_setup_e820 in order to keep consistency with the
   p2m code.

Changes since v3:
 - Drop get_order_from_bytes_floor, it was only used by
   hvm_populate_memory_range.
 - Switch hvm_populate_memory_range to use frame numbers instead of full memory
   addresses.
 - Add a helper to steal the low 1MB RAM areas from dom_io and add them to Dom0
   as normal RAM.
 - Introduce unshare_xen_page_with_guest in order to remove pages from dom_io,
   so they can be assigned to other domains. This is needed in order to remove
   the low 1MB RAM regions from dom_io and assign them to the hardware_domain.
 - Simplify the loop in hvm_steal_ram.
 - Move definition of map_identity_mmio into this patch.

Changes since v2:
 - Introduce get_order_from_bytes_floor as a local function to
   domain_build.c.
 - Remove extra asserts.
 - Make hvm_populate_memory_range return an error code instead of panicking.
 - Fix comments and printks.
 - Use ULL sufix instead of casting to uint64_t.
 - Rename hvm_setup_vmx_unrestricted_guest to
   hvm_setup_vmx_realmode_helpers.
 - Only substract two pages from the memory calculation, that will be used
   by the MADT replacement.
 - Remove some comments.
 - Remove printing allocation information.
 - Don't stash any pages for the MADT, TSS or ident PT, those will be
   subtracted directly from RAM regions of the memory map.
 - Count the number of iterations before calling process_pending_softirqs
   when populating the memory map.
 - Move the initial call to process_pending_softirqs into construct_dom0,
   and remove the ones from construct_dom0_hvm and construct_dom0_pv.
 - Make memflags global so it can be shared between alloc_chunk and
   hvm_populate_memory_range.

Changes since RFC:
 - Use IS_ALIGNED instead of checking with PAGE_MASK.
 - Use the new %pB specifier in order to print sizes in human readable form.
 - Create a VM86 TSS for hardware that doesn't support unrestricted mode.
 - Subtract guest RAM for the identity page table and the VM86 TSS.
 - Split the creation of the unrestricted mode helper structures to a
   separate function.
 - Use preemption with paging_set_allocation.
 - Use get_order_from_bytes_floor.
---
 xen/arch/x86/domain_build.c | 299 +++++++++++++++++++++++++++++++++++++++++++-
 xen/arch/x86/mm.c           |  16 +++
 xen/include/asm-x86/mm.h    |   2 +
 3 files changed, 312 insertions(+), 5 deletions(-)

Andrew Cooper Jan. 20, 2017, 7:41 p.m. UTC | #1

On 19/01/17 17:29, Roger Pau Monne wrote:
> +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
> +{
> +    p2m_type_t p2mt;
> +    uint32_t rc, *ident_pt;
> +    uint8_t *tss;
> +    mfn_t mfn;
> +    paddr_t gaddr;
> +    unsigned int i;
> +
> +    /*
> +     * Steal some space from the last found RAM region. One page will be
> +     * used for the identity page tables, and the remaining space for the
> +     * VM86 TSS. Note that after this not all e820 regions will be aligned
> +     * to PAGE_SIZE.
> +     */
> +    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) )
> +    {
> +        printk("Unable to find memory to stash the identity map and TSS\n");
> +        return -ENOMEM;
> +    }
> +
> +    /*
> +     * Identity-map page table is required for running with CR0.PG=0
> +     * when using Intel EPT. Create a 32-bit non-PAE page directory of
> +     * superpages.
> +     */
> +    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> +                              &mfn, &p2mt, 0, &rc);
> +    if ( ident_pt == NULL )
> +    {
> +        printk("Unable to map identity page tables\n");
> +        return -ENOMEM;
> +    }
> +    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
> +        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
> +                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);

Please can you make helper for this and dedup it with shadow_enable(). 
Something like:

void write_pse_identmap(uint32_t *l2)

rather than duplicating this particular piece of magic.  (It can
probably even be static inline.)

> +    unmap_domain_page(ident_pt);
> +    put_page(mfn_to_page(mfn_x(mfn)));
> +    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
> +    gaddr += PAGE_SIZE;
> +    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));
> +
> +    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> +                         &mfn, &p2mt, 0, &rc);
> +    if ( tss == NULL )
> +    {
> +        printk("Unable to map VM86 TSS area\n");
> +        return 0;
> +    }
> +
> +    memset(tss, 0, HVM_VM86_TSS_SIZE);

Do we actually need to 0 this?  Don't we guarantee to hand out zero'd
pages during construction?  (I can't actually recall.  Perhaps it is
better to explicitly clear it.)

~Andrew

Jan Beulich Jan. 23, 2017, 11:23 a.m. UTC | #2

>>> On 20.01.17 at 20:41, <andrew.cooper3@citrix.com> wrote:
> On 19/01/17 17:29, Roger Pau Monne wrote:
>> +    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
>> +                         &mfn, &p2mt, 0, &rc);
>> +    if ( tss == NULL )
>> +    {
>> +        printk("Unable to map VM86 TSS area\n");
>> +        return 0;
>> +    }
>> +
>> +    memset(tss, 0, HVM_VM86_TSS_SIZE);
> 
> Do we actually need to 0 this?  Don't we guarantee to hand out zero'd
> pages during construction?  (I can't actually recall.  Perhaps it is
> better to explicitly clear it.)

No, we don't zero before handing out, we zero after a reclaiming
memory from a dying guest or from the hypervisor.

Jan

Boris Ostrovsky Jan. 23, 2017, 2:11 p.m. UTC | #3

>  
> +static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
> +                                       unsigned long nr_pages, bool map)
> +{
> +    int rc;
> +
> +    for ( ; ; )
> +    {
> +        rc = (map ? map_mmio_regions : unmap_mmio_regions)

This can be taken outside the loop.

-boris

> +             (d, _gfn(pfn), nr_pages, _mfn(pfn));
> +        if ( rc == 0 )
> +            break;
> +        if ( rc < 0 )
> +        {
> +            printk(XENLOG_WARNING
> +                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
> +                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
> +            break;
> +        }
> +        nr_pages -= rc;
> +        pfn += rc;
> +        process_pending_softirqs();
> +    }
> +
> +    return rc;
> +}
> +

Roger Pau Monne Jan. 23, 2017, 2:43 p.m. UTC | #4

On Mon, Jan 23, 2017 at 09:11:06AM -0500, Boris Ostrovsky wrote:
> 
> >  
> > +static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
> > +                                       unsigned long nr_pages, bool map)
> > +{
> > +    int rc;
> > +
> > +    for ( ; ; )
> > +    {
> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> 
> This can be taken outside the loop.

Maybe I can instead make map const, and the compiler should optimize this
itself?

I find it a little cumbersome to store function pointers, ie:

int (*mapf)(struct domain *, gfn_t, unsigned long, mfn_t) = ...;

Roger.

Jan Beulich Jan. 26, 2017, 12:41 p.m. UTC | #5

>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
>  static long __initdata dom0_min_nrpages;
>  static long __initdata dom0_max_nrpages = LONG_MAX;
>  
> +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> +#define HVM_VM86_TSS_SIZE   128

I continue to be puzzled by this value. Why 128? I think this really
needs to be clarified in the comment.

> @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages(
>              avail -= max_pdx >> s;
>      }
>  
> -    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
> +    need_paging = opt_dom0_shadow ||
> +                  (has_hvm_container_domain(d) && (!iommu_hap_pt_share ||
> +                                                   !paging_mode_hap(d)));

What is the !paging_mode_hap() part good for? It's being taken care
of by checking opt_dom0_shadow already, isn't it? Alternatively, to
make the distinction more obvious, I'd suggest

    need_paging = has_hvm_container_domain(d)
                  ? !iommu_hap_pt_share || !paging_mode_hap(d)
                  : opt_dom0_shadow;

> @@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
>              continue;
>          }
>  
> -        *entry_guest = *entry;
> -        pages = PFN_UP(entry_guest->size);
> +        /*
> +         * Make sure the start and length are aligned to PAGE_SIZE, because
> +         * that's the minimum granularity of the 2nd stage translation. Since
> +         * the p2m code uses PAGE_ORDER_4K internally, also use it here in
> +         * order to prevent this code from getting out of sync.
> +         */
> +        start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT);

You definitely don't need to use _AC() in C code. But the whole thing
can anyway simply be

        start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K);

(albeit I'd like to note that if anything we'd have to be prepared
for page sizes > 4k, not smaller ones, and the whole idea of
PAGE_ORDER_4K breaks in that case).

> +        end = (entry->addr + entry->size) &
> +              ~((_AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT) - 1 );

On top of the above, please remove the stray blank from near
the end of this statement.

> +static int __init pvh_steal_ram(struct domain *d, unsigned long size,
> +                                paddr_t limit, paddr_t *addr)
> +{
> +    unsigned int i = d->arch.nr_e820;
> +
> +    while ( i-- )
> +    {
> +        struct e820entry *entry = &d->arch.e820[i];
> +
> +        if ( entry->type != E820_RAM || entry->size < size )
> +            continue;
> +
> +        /* Subtract from the beginning. */
> +        if ( entry->addr + size <= limit && entry->addr >= MB(1) )
> +        {
> +            *addr = entry->addr;
> +            entry->addr += size;
> +            entry->size -= size;

The comment says so, but why from the beginning? Wouldn't it be
better to steal from the end of the highest range below 4Gb, to
keep an overall more conventional layout?

> +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
> +{
> +    p2m_type_t p2mt;
> +    uint32_t rc, *ident_pt;
> +    uint8_t *tss;
> +    mfn_t mfn;
> +    paddr_t gaddr;
> +    unsigned int i;
> +
> +    /*
> +     * Steal some space from the last found RAM region. One page will be
> +     * used for the identity page tables, and the remaining space for the
> +     * VM86 TSS. Note that after this not all e820 regions will be aligned
> +     * to PAGE_SIZE.
> +     */
> +    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) 
> )
> +    {
> +        printk("Unable to find memory to stash the identity map and TSS\n");
> +        return -ENOMEM;
> +    }
> +
> +    /*
> +     * Identity-map page table is required for running with CR0.PG=0
> +     * when using Intel EPT. Create a 32-bit non-PAE page directory of
> +     * superpages.
> +     */
> +    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> +                              &mfn, &p2mt, 0, &rc);
> +    if ( ident_pt == NULL )
> +    {
> +        printk("Unable to map identity page tables\n");
> +        return -ENOMEM;
> +    }
> +    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
> +        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
> +                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
> +    unmap_domain_page(ident_pt);
> +    put_page(mfn_to_page(mfn_x(mfn)));
> +    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
> +    gaddr += PAGE_SIZE;
> +    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));

This comes too late - the page table setup above also requires
page alignment (and with that, adding PAGE_SIZE would not break
the alignment requirement). Even more, the code below doesn't
strictly require page alignment, it only requires for the range to
not cross a page boundary.

> +    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> +                         &mfn, &p2mt, 0, &rc);
> +    if ( tss == NULL )
> +    {
> +        printk("Unable to map VM86 TSS area\n");
> +        return 0;
> +    }
> +
> +    memset(tss, 0, HVM_VM86_TSS_SIZE);
> +    unmap_domain_page(tss);
> +    put_page(mfn_to_page(mfn_x(mfn)));
> +    d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
> +
> +    return 0;

While I've seen the code a number of times by now, I still can't
help disliking the early success return (accompanied by an error
message). I think this not being a mistake would be more obvious
with

    if ( tss )
    {
    }
    else
        printk();
    return 0;

> +static int __init pvh_setup_p2m(struct domain *d)
> +{
> +    struct vcpu *v = d->vcpu[0];
> +    unsigned long nr_pages;
> +    unsigned int i;
> +    int rc;
> +    bool preempted;
> +#define MB1_PAGES PFN_DOWN(MB(1))
> +
> +    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
> +
> +    pvh_setup_e820(d, nr_pages);
> +    do {
> +        preempted = false;
> +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> +                              &preempted);
> +        process_pending_softirqs();
> +    } while ( preempted );
> +
> +    /*
> +     * Memory below 1MB is identity mapped.
> +     * NB: this only makes sense when booted from legacy BIOS.
> +     */
> +    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);

MB1_PAGES

> +    if ( rc )
> +    {
> +        printk("Failed to identity map low 1MB: %d\n", rc);
> +        return rc;
> +    }
> +
> +    /* Populate memory map. */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        unsigned long addr, size;
> +
> +        if ( d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        addr = PFN_DOWN(d->arch.e820[i].addr);
> +        size = PFN_DOWN(d->arch.e820[i].size);
> +
> +        ASSERT(addr >= MB1_PAGES || addr + size < MB1_PAGES);
> +
> +        if ( addr >= MB1_PAGES )
> +            rc = pvh_populate_memory_range(d, addr, size);
> +        else
> +            pvh_steal_low_ram(d, addr, size);

Would you mind shortening the ASSERT() expression above by
moving it into the else branch here?

Jan

Tim Deegan Jan. 27, 2017, 11:14 a.m. UTC | #6

At 05:41 -0700 on 26 Jan (1485409318), Jan Beulich wrote:
> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> > +#define HVM_VM86_TSS_SIZE   128
> 
> I continue to be puzzled by this value. Why 128? I think this really
> needs to be clarified in the comment.

I was asked on IRC to do some archaeology / explain myself about this,
so here goes.

First, the _intended_ mechanism for "real mode" guests on older VMX
hardware is to run them in virtual 8086 mode inside the guest as much
as possible, and emulate whenever we can't do that.

This is managed with some state in v->arch.hvm_vmx:
 - vmx_realmode, set when the guest thinks it's in real mode. 
 - vmx_emulate, to force emulation rather than VMENTER
   We set this when we have exceptions to inject, as the VMX hardware
   would try to inject them in 32-bit protected mode.
 - vm86_segment_mask, a bitmask of segments that can't be fudged
   to run in virtual 8086 mode.

When vmx_realmode is set, vmx_do_vmentry() DTRT: it bails out into the
emulator if either vmx_emulate or any bit in vm86_segment_mask is set;
otherwise it calls vmx_enter_realmode() to adjust %rflags and enters
the guest in virtual 8086 mode.

The reason we need a TSS at all is for handling software interrupts.
Virtual 8086 mode has two ways to handle software interrupts: stay in
virtual 8086 mode and vector via the table @0x0, or raise #GP in 32-bit
protected mode.  We want the first of those, so that a guest in 'real mode'
can make BIOS calls.

The CPU uses a bitmap in the TSS to decide which method to use; we
need all the bits in that bitmap to be clear.  In my SDM (April 2016)
this is section 20.3.3 "Class 3 -- Software Interrupt Handling in
Virtual-8086 Mode", table 20-2, method 5.

---

So far so good, and AIUI the system works -- or at least it did in
December 2008 when it was put in (8d4638d1), because emulating every
instruction made Windows boot times so slow that we would definitely
have noticed.

But looking at it now, I'm not convinced of exactly how.  The magic
bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
base address itself lives at offset 100.  A zero'd TSS should mean an
I/O map at 0, and an interrupt redirection bitmap at -32, which would
plausibly work if the TSS were 256 bytes (matching the limit set in
Xen).  Perhaps it's only working because the 128 bytes following the
TSS in hvmloader happen to be zeros too?

I also don't remember why the TSS is 128 rather than 104 bytes.  The
SDM claims that the TSS must be larger than 104 bytes "when accessing
the I/O permission bit map or interrupt redirection bit map."
(7.2.2. "TSS Descriptor") but I suspect that just means that the
generated address of the bitmap must lie inside the limit.

In any case, the limit set in vmx_set_segment_register() should surely
match the size of the actual TSS!

I haven't got the time or hardware to test this right now, but could
maybe look at it next week unless anyone else wants to play with it.

Cheers,

Tim.

Roger Pau Monne Jan. 27, 2017, 12:23 p.m. UTC | #7

On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote:
> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
> >  static long __initdata dom0_min_nrpages;
> >  static long __initdata dom0_max_nrpages = LONG_MAX;
> >  
> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> > +#define HVM_VM86_TSS_SIZE   128
> 
> I continue to be puzzled by this value. Why 128? I think this really
> needs to be clarified in the comment.

Given the recent comments by Tim, and that this is starting to look like a can
of worms, I would like to leave this as-is for the moment, on the grounds that
it's what hvmloader does (I'm not saying it's right), and that this issue
should be treated independently from this patch series.

Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0.
IIRC I've tried that before (without unrestricted mode support) and it was
working fine.

> > @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages(
> >              avail -= max_pdx >> s;
> >      }
> >  
> > -    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
> > +    need_paging = opt_dom0_shadow ||
> > +                  (has_hvm_container_domain(d) && (!iommu_hap_pt_share ||
> > +                                                   !paging_mode_hap(d)));
> 
> What is the !paging_mode_hap() part good for? It's being taken care
> of by checking opt_dom0_shadow already, isn't it? Alternatively, to
> make the distinction more obvious, I'd suggest
> 
>     need_paging = has_hvm_container_domain(d)
>                   ? !iommu_hap_pt_share || !paging_mode_hap(d)
>                   : opt_dom0_shadow;

AFAICT it *might* be possible to run a PVHv2 Dom0 on a box with no EPT, but
with an IOMMU? Does that exist? In that case opt_dom0_shadow won't be set, but
paging_mode_hap would be false. Maybe that's just an impossible combination in
any case...

> > @@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
> >              continue;
> >          }
> >  
> > -        *entry_guest = *entry;
> > -        pages = PFN_UP(entry_guest->size);
> > +        /*
> > +         * Make sure the start and length are aligned to PAGE_SIZE, because
> > +         * that's the minimum granularity of the 2nd stage translation. Since
> > +         * the p2m code uses PAGE_ORDER_4K internally, also use it here in
> > +         * order to prevent this code from getting out of sync.
> > +         */
> > +        start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT);
> 
> You definitely don't need to use _AC() in C code. But the whole thing
> can anyway simply be
> 
>         start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K);
> 
> (albeit I'd like to note that if anything we'd have to be prepared
> for page sizes > 4k, not smaller ones, and the whole idea of
> PAGE_ORDER_4K breaks in that case).

Thanks, I will change as per your recommendation above, although I'm not sure
what to do with the PAGE_ORDER_4K thing. Are you fine with leaving it like you
suggest?

> > +        end = (entry->addr + entry->size) &
> > +              ~((_AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT) - 1 );
> 
> On top of the above, please remove the stray blank from near
> the end of this statement.

I've changed that to:

        end = (entry->addr + entry->size) &
              ~((PAGE_SIZE << PAGE_ORDER_4K) - 1);

In order to match with the above.

> > +static int __init pvh_steal_ram(struct domain *d, unsigned long size,
> > +                                paddr_t limit, paddr_t *addr)
> > +{
> > +    unsigned int i = d->arch.nr_e820;
> > +
> > +    while ( i-- )
> > +    {
> > +        struct e820entry *entry = &d->arch.e820[i];
> > +
> > +        if ( entry->type != E820_RAM || entry->size < size )
> > +            continue;
> > +
> > +        /* Subtract from the beginning. */
> > +        if ( entry->addr + size <= limit && entry->addr >= MB(1) )
> > +        {
> > +            *addr = entry->addr;
> > +            entry->addr += size;
> > +            entry->size -= size;
> 
> The comment says so, but why from the beginning? Wouldn't it be
> better to steal from the end of the highest range below 4Gb, to
> keep an overall more conventional layout?

That sounds sensible, let me change it to:

        /* Subtract from the end. */
        if ( entry->addr + entry->size + size <= limit &&
             entry->addr >= MB(1) )
        {
            entry->size -= size;
            *addr = entry->addr + entry->size;
            return 0;
        }

This is going to involve some changes in pvh_setup_vmx_realmode_helpers, see
below.

> > +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
> > +{
> > +    p2m_type_t p2mt;
> > +    uint32_t rc, *ident_pt;
> > +    uint8_t *tss;
> > +    mfn_t mfn;
> > +    paddr_t gaddr;
> > +    unsigned int i;
> > +
> > +    /*
> > +     * Steal some space from the last found RAM region. One page will be
> > +     * used for the identity page tables, and the remaining space for the
> > +     * VM86 TSS. Note that after this not all e820 regions will be aligned
> > +     * to PAGE_SIZE.
> > +     */
> > +    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) 
> > )
> > +    {
> > +        printk("Unable to find memory to stash the identity map and TSS\n");
> > +        return -ENOMEM;
> > +    }
> > +
> > +    /*
> > +     * Identity-map page table is required for running with CR0.PG=0
> > +     * when using Intel EPT. Create a 32-bit non-PAE page directory of
> > +     * superpages.
> > +     */
> > +    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> > +                              &mfn, &p2mt, 0, &rc);
> > +    if ( ident_pt == NULL )
> > +    {
> > +        printk("Unable to map identity page tables\n");
> > +        return -ENOMEM;
> > +    }
> > +    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
> > +        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
> > +                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
> > +    unmap_domain_page(ident_pt);
> > +    put_page(mfn_to_page(mfn_x(mfn)));
> > +    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
> > +    gaddr += PAGE_SIZE;
> > +    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));
> 
> This comes too late - the page table setup above also requires
> page alignment (and with that, adding PAGE_SIZE would not break
> the alignment requirement). Even more, the code below doesn't
> strictly require page alignment, it only requires for the range to
> not cross a page boundary.

Given the change that you requested in pvh_steal_ram, now the start of the
memory area returned by it it's not going to be page-aligned, so I will have to
perform the TSS setup first, and then the identity page tables.

> > +    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> > +                         &mfn, &p2mt, 0, &rc);
> > +    if ( tss == NULL )
> > +    {
> > +        printk("Unable to map VM86 TSS area\n");
> > +        return 0;
> > +    }
> > +
> > +    memset(tss, 0, HVM_VM86_TSS_SIZE);
> > +    unmap_domain_page(tss);
> > +    put_page(mfn_to_page(mfn_x(mfn)));
> > +    d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
> > +
> > +    return 0;
> 
> While I've seen the code a number of times by now, I still can't
> help disliking the early success return (accompanied by an error
> message). I think this not being a mistake would be more obvious
> with
> 
>     if ( tss )
>     {
>     }
>     else
>         printk();
>     return 0;

That's not a problem, I will change it given that I will also have to move this
before the setup of the identity page tables.

> > +static int __init pvh_setup_p2m(struct domain *d)
> > +{
> > +    struct vcpu *v = d->vcpu[0];
> > +    unsigned long nr_pages;
> > +    unsigned int i;
> > +    int rc;
> > +    bool preempted;
> > +#define MB1_PAGES PFN_DOWN(MB(1))
> > +
> > +    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
> > +
> > +    pvh_setup_e820(d, nr_pages);
> > +    do {
> > +        preempted = false;
> > +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> > +                              &preempted);
> > +        process_pending_softirqs();
> > +    } while ( preempted );
> > +
> > +    /*
> > +     * Memory below 1MB is identity mapped.
> > +     * NB: this only makes sense when booted from legacy BIOS.
> > +     */
> > +    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);
> 
> MB1_PAGES
> 
> > +    if ( rc )
> > +    {
> > +        printk("Failed to identity map low 1MB: %d\n", rc);
> > +        return rc;
> > +    }
> > +
> > +    /* Populate memory map. */
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        unsigned long addr, size;
> > +
> > +        if ( d->arch.e820[i].type != E820_RAM )
> > +            continue;
> > +
> > +        addr = PFN_DOWN(d->arch.e820[i].addr);
> > +        size = PFN_DOWN(d->arch.e820[i].size);
> > +
> > +        ASSERT(addr >= MB1_PAGES || addr + size < MB1_PAGES);
> > +
> > +        if ( addr >= MB1_PAGES )
> > +            rc = pvh_populate_memory_range(d, addr, size);
> > +        else
> > +            pvh_steal_low_ram(d, addr, size);
> 
> Would you mind shortening the ASSERT() expression above by
> moving it into the else branch here?

Fixed both of the above, thanks.

Roger.

Roger Pau Monne Jan. 27, 2017, 12:37 p.m. UTC | #8

On Fri, Jan 27, 2017 at 11:14:10AM +0000, Tim Deegan wrote:
> At 05:41 -0700 on 26 Jan (1485409318), Jan Beulich wrote:
> > >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> > > +#define HVM_VM86_TSS_SIZE   128
> > 
> > I continue to be puzzled by this value. Why 128? I think this really
> > needs to be clarified in the comment.
> 
> I was asked on IRC to do some archaeology / explain myself about this,
> so here goes.
> 
> First, the _intended_ mechanism for "real mode" guests on older VMX
> hardware is to run them in virtual 8086 mode inside the guest as much
> as possible, and emulate whenever we can't do that.
> 
> This is managed with some state in v->arch.hvm_vmx:
>  - vmx_realmode, set when the guest thinks it's in real mode. 
>  - vmx_emulate, to force emulation rather than VMENTER
>    We set this when we have exceptions to inject, as the VMX hardware
>    would try to inject them in 32-bit protected mode.
>  - vm86_segment_mask, a bitmask of segments that can't be fudged
>    to run in virtual 8086 mode.
> 
> When vmx_realmode is set, vmx_do_vmentry() DTRT: it bails out into the
> emulator if either vmx_emulate or any bit in vm86_segment_mask is set;
> otherwise it calls vmx_enter_realmode() to adjust %rflags and enters
> the guest in virtual 8086 mode.
> 
> The reason we need a TSS at all is for handling software interrupts.
> Virtual 8086 mode has two ways to handle software interrupts: stay in
> virtual 8086 mode and vector via the table @0x0, or raise #GP in 32-bit
> protected mode.  We want the first of those, so that a guest in 'real mode'
> can make BIOS calls.
> 
> The CPU uses a bitmap in the TSS to decide which method to use; we
> need all the bits in that bitmap to be clear.  In my SDM (April 2016)
> this is section 20.3.3 "Class 3 -- Software Interrupt Handling in
> Virtual-8086 Mode", table 20-2, method 5.
> 
> ---
> 
> So far so good, and AIUI the system works -- or at least it did in
> December 2008 when it was put in (8d4638d1), because emulating every
> instruction made Windows boot times so slow that we would definitely
> have noticed.
> 
> But looking at it now, I'm not convinced of exactly how.  The magic
> bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
> base address itself lives at offset 100.  A zero'd TSS should mean an
> I/O map at 0, and an interrupt redirection bitmap at -32, which would
> plausibly work if the TSS were 256 bytes (matching the limit set in
> Xen).  Perhaps it's only working because the 128 bytes following the
> TSS in hvmloader happen to be zeros too?

Right, so *if* this was working as intended, the interrupt bitmap would be at
HVM_PARAM_VM86_TSS - 32, which we don't guarantee to zero at all.

I've also looked at the manual, and it states that the last bit of the IO
bitmap should be filled with 1s[0], which we don't do at all. Also, what's the
expected size of the IO bitmap, 64KB?

Roger.

[0] Vol3, section 20.3.3 "Class 3-Software Interrupt Handling in Virtual-8086
Mode, Figure 20-5.

Andrew Cooper Jan. 27, 2017, 12:51 p.m. UTC | #9

On 27/01/17 11:14, Tim Deegan wrote:
> At 05:41 -0700 on 26 Jan (1485409318), Jan Beulich wrote:
>>>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>>> +/* Size of the VM86 TSS for virtual 8086 mode to use. */
>>> +#define HVM_VM86_TSS_SIZE   128
>> I continue to be puzzled by this value. Why 128? I think this really
>> needs to be clarified in the comment.
> I was asked on IRC to do some archaeology / explain myself about this,
> so here goes.
>
> First, the _intended_ mechanism for "real mode" guests on older VMX
> hardware is to run them in virtual 8086 mode inside the guest as much
> as possible, and emulate whenever we can't do that.
>
> This is managed with some state in v->arch.hvm_vmx:
>  - vmx_realmode, set when the guest thinks it's in real mode. 
>  - vmx_emulate, to force emulation rather than VMENTER
>    We set this when we have exceptions to inject, as the VMX hardware
>    would try to inject them in 32-bit protected mode.
>  - vm86_segment_mask, a bitmask of segments that can't be fudged
>    to run in virtual 8086 mode.
>
> When vmx_realmode is set, vmx_do_vmentry() DTRT: it bails out into the
> emulator if either vmx_emulate or any bit in vm86_segment_mask is set;
> otherwise it calls vmx_enter_realmode() to adjust %rflags and enters
> the guest in virtual 8086 mode.

Ah - this is where I went wrong.  I'd logically combined
vmx_enter_realmode and vmx_realmode when reading the assembly.

>
> The reason we need a TSS at all is for handling software interrupts.
> Virtual 8086 mode has two ways to handle software interrupts: stay in
> virtual 8086 mode and vector via the table @0x0, or raise #GP in 32-bit
> protected mode.  We want the first of those, so that a guest in 'real mode'
> can make BIOS calls.
>
> The CPU uses a bitmap in the TSS to decide which method to use; we
> need all the bits in that bitmap to be clear.  In my SDM (April 2016)
> this is section 20.3.3 "Class 3 -- Software Interrupt Handling in
> Virtual-8086 Mode", table 20-2, method 5.
>
> ---
>
> So far so good, and AIUI the system works -- or at least it did in
> December 2008 when it was put in (8d4638d1), because emulating every
> instruction made Windows boot times so slow that we would definitely
> have noticed.
>
> But looking at it now, I'm not convinced of exactly how.  The magic
> bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
> base address itself lives at offset 100.  A zero'd TSS should mean an
> I/O map at 0, and an interrupt redirection bitmap at -32, which would
> plausibly work if the TSS were 256 bytes (matching the limit set in
> Xen).  Perhaps it's only working because the 128 bytes following the
> TSS in hvmloader happen to be zeros too?

With an IO_base_map of 0, the software interrupt bitmap will end up
being ahead of the TSS, not after it.

I would not be surprised if this turns out that microcode doesn't range
check against TSS.base.

> I also don't remember why the TSS is 128 rather than 104 bytes.  The
> SDM claims that the TSS must be larger than 104 bytes "when accessing
> the I/O permission bit map or interrupt redirection bit map."
> (7.2.2. "TSS Descriptor") but I suspect that just means that the
> generated address of the bitmap must lie inside the limit.

The documented way of expressing "no IO bitmap" is to set the map base
to a value which exceeds the TSS limit.  All this means (I think) is
that you must make a larger than default TSS if you want to use a IO or
software interrupt bitmap.

> In any case, the limit set in vmx_set_segment_register() should surely
> match the size of the actual TSS.
> I haven't got the time or hardware to test this right now, but could
> maybe look at it next week unless anyone else wants to play with it.

I have hardware.  I will look into it when I have a moment, unless
anyone beats me to it.

~Andrew

Tim Deegan Jan. 27, 2017, 1:20 p.m. UTC | #10

Hi,

At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote:
> On 27/01/17 11:14, Tim Deegan wrote:
> > But looking at it now, I'm not convinced of exactly how.  The magic
> > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
> > base address itself lives at offset 100.  A zero'd TSS should mean an
> > I/O map at 0, and an interrupt redirection bitmap at -32, which would
> > plausibly work if the TSS were 256 bytes (matching the limit set in
> > Xen).  Perhaps it's only working because the 128 bytes following the
> > TSS in hvmloader happen to be zeros too?
> 
> With an IO_base_map of 0, the software interrupt bitmap will end up
> being ahead of the TSS, not after it.

I should have thought that the segmented address calculation would
wrap and leave us at TSS + 224.

> > I also don't remember why the TSS is 128 rather than 104 bytes.  The
> > SDM claims that the TSS must be larger than 104 bytes "when accessing
> > the I/O permission bit map or interrupt redirection bit map."
> > (7.2.2. "TSS Descriptor") but I suspect that just means that the
> > generated address of the bitmap must lie inside the limit.
> 
> The documented way of expressing "no IO bitmap" is to set the map base
> to a value which exceeds the TSS limit.  All this means (I think) is
> that you must make a larger than default TSS if you want to use a IO or
> software interrupt bitmap.

Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
enough space for a full one, but the current SDM is pretty clear that
the CPU will try to check it in virtual 8086 mode.

It may be that all the ports actually used happen to fall in the 128
bytes of zeros that we provide.

Or possibly (both for this and the interrupt bitmap) we are causing
#GP and somehow ending up exiting-and-emulating.  But I don't see
quite what the path is for that.

Cheers,

Tim.

Andrew Cooper Jan. 27, 2017, 1:46 p.m. UTC | #11

On 27/01/17 13:20, Tim Deegan wrote:
> Hi,
>
> At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote:
>> On 27/01/17 11:14, Tim Deegan wrote:
>>> But looking at it now, I'm not convinced of exactly how.  The magic
>>> bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
>>> base address itself lives at offset 100.  A zero'd TSS should mean an
>>> I/O map at 0, and an interrupt redirection bitmap at -32, which would
>>> plausibly work if the TSS were 256 bytes (matching the limit set in
>>> Xen).  Perhaps it's only working because the 128 bytes following the
>>> TSS in hvmloader happen to be zeros too?
>> With an IO_base_map of 0, the software interrupt bitmap will end up
>> being ahead of the TSS, not after it.
> I should have thought that the segmented address calculation would
> wrap and leave us at TSS + 224.

As far as I am aware, this is the only case of a system descriptor
access which could end up negative (relative to base).  All IDT/GDT/LDT
accesses are sensibly bounded by the validity of their trigger conditions.

I'd expect microcode to calculate TSS.base + I/O base - 32 +
bit_of(vector) on the expectation that an OS actually wanting this to
work would have set it up properly.

The actual behaviour can be determined by putting the TSS on a page
boundary, making the previous frame non-readable via EPT, and seeing
whether an EPT violation occurs.  (I haven't yet got far enough in my
nested virt work for this to be an easy thing to configure, but it is
possible by manually clobbering unrestricted mode on a newer processor
and using HAP.)

>
>>> I also don't remember why the TSS is 128 rather than 104 bytes.  The
>>> SDM claims that the TSS must be larger than 104 bytes "when accessing
>>> the I/O permission bit map or interrupt redirection bit map."
>>> (7.2.2. "TSS Descriptor") but I suspect that just means that the
>>> generated address of the bitmap must lie inside the limit.
>> The documented way of expressing "no IO bitmap" is to set the map base
>> to a value which exceeds the TSS limit.  All this means (I think) is
>> that you must make a larger than default TSS if you want to use a IO or
>> software interrupt bitmap.
> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
> enough space for a full one, but the current SDM is pretty clear that
> the CPU will try to check it in virtual 8086 mode.
>
> It may be that all the ports actually used happen to fall in the 128
> bytes of zeros that we provide.

With an offset of 0, we actually provide 256 bytes of zeros in the
bitmap within the TSS limit.

> Or possibly (both for this and the interrupt bitmap) we are causing
> #GP and somehow ending up exiting-and-emulating.  But I don't see
> quite what the path is for that.

We set IOPL to 3 as well as when entering vm86 to fake up real mode. 
This bypasses all I/O bitmap checks (a properly common to ring 3
protected tasks as well - See specifically 20.2.7 "Sensitive
Instructions"), which means the IN/OUT instructions end up directly at
the relevant vmexit case.

~Andrew

Tim Deegan Jan. 27, 2017, 2:01 p.m. UTC | #12

Hi,

At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote:
> The actual behaviour can be determined by putting the TSS on a page
> boundary, making the previous frame non-readable via EPT, and seeing
> whether an EPT violation occurs.

Indeed.  Or likewise with normal pagetables. 

> > Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
> > enough space for a full one, but the current SDM is pretty clear that
> > the CPU will try to check it in virtual 8086 mode.
> >
> > It may be that all the ports actually used happen to fall in the 128
> > bytes of zeros that we provide.
> 
> With an offset of 0, we actually provide 256 bytes of zeros in the
> bitmap within the TSS limit.

Sure, or at least 128 bytes of zeros and another 128 bytes of something.

> > Or possibly (both for this and the interrupt bitmap) we are causing
> > #GP and somehow ending up exiting-and-emulating.  But I don't see
> > quite what the path is for that.
> 
> We set IOPL to 3 as well as when entering vm86 to fake up real mode. 
> This bypasses all I/O bitmap checks (a properly common to ring 3
> protected tasks as well - See specifically 20.2.7 "Sensitive
> Instructions"), which means the IN/OUT instructions end up directly at
> the relevant vmexit case.

20.2.8.1 makes it clear that this is not the case -- in virtual 8086
mode all IN/OUT ops check the bitmap event with IOPL == CPL.

Tim.

Andrew Cooper Jan. 27, 2017, 2:35 p.m. UTC | #13

On 27/01/17 14:01, Tim Deegan wrote:
> Hi,
>
> At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote:
>> The actual behaviour can be determined by putting the TSS on a page
>> boundary, making the previous frame non-readable via EPT, and seeing
>> whether an EPT violation occurs.
> Indeed.  Or likewise with normal pagetables. 
>
>>> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
>>> enough space for a full one, but the current SDM is pretty clear that
>>> the CPU will try to check it in virtual 8086 mode.
>>>
>>> It may be that all the ports actually used happen to fall in the 128
>>> bytes of zeros that we provide.
>> With an offset of 0, we actually provide 256 bytes of zeros in the
>> bitmap within the TSS limit.
> Sure, or at least 128 bytes of zeros and another 128 bytes of something.

That is a good point.  Nothing prevents a guest exiting vm86 mode, and
using a task switch to move to a new tss, which will cause Xen to write
state back into the vm86_tss, making it no longer a zeroed block of memory.

Despite being owned by the guest, this TSS is actually managed by Xen. 
It should be initialised to defaults each time Xen needs to use it on
behalf of the guest.

>>> Or possibly (both for this and the interrupt bitmap) we are causing
>>> #GP and somehow ending up exiting-and-emulating.  But I don't see
>>> quite what the path is for that.
>> We set IOPL to 3 as well as when entering vm86 to fake up real mode. 
>> This bypasses all I/O bitmap checks (a properly common to ring 3
>> protected tasks as well - See specifically 20.2.7 "Sensitive
>> Instructions"), which means the IN/OUT instructions end up directly at
>> the relevant vmexit case.
> 20.2.8.1 makes it clear that this is not the case -- in virtual 8086
> mode all IN/OUT ops check the bitmap event with IOPL == CPL.

Hmm.  Right you area, which explains why the TSS limit is greater than
0x67. 

If the emulation code were working correctly, the emulator should come
to the same conclusion as hardware and inject a #GP fault.  I suspect it
is more likely that RomBIOS doesn't use a port higher than we have
bitmap space for.

~Andrew

Jan Beulich Jan. 27, 2017, 3:11 p.m. UTC | #14

>>> On 27.01.17 at 13:23, <roger.pau@citrix.com> wrote:
> On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote:
>> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
>> >  static long __initdata dom0_min_nrpages;
>> >  static long __initdata dom0_max_nrpages = LONG_MAX;
>> >  
>> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
>> > +#define HVM_VM86_TSS_SIZE   128
>> 
>> I continue to be puzzled by this value. Why 128? I think this really
>> needs to be clarified in the comment.
> 
> Given the recent comments by Tim, and that this is starting to look like a can
> of worms, I would like to leave this as-is for the moment, on the grounds that
> it's what hvmloader does (I'm not saying it's right), and that this issue
> should be treated independently from this patch series.

Well, for the purpose of this patch it would be sufficient if the
comment referred to hvmloader. But then I think I saw you set the
TSS limit to 0x67, which is neither in line with the value above nor
- according to what Tim said (but I didn't check myself yet) - the
255 used in hvmloader. I.e. if you clone hvmloader code, all
aspects of it should match.

> Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0.
> IIRC I've tried that before (without unrestricted mode support) and it was
> working fine.

Now if that's the case, then why bother with the TSS?

>> > @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages(
>> >              avail -= max_pdx >> s;
>> >      }
>> >  
>> > -    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
>> > +    need_paging = opt_dom0_shadow ||
>> > +                  (has_hvm_container_domain(d) && (!iommu_hap_pt_share ||
>> > +                                                   !paging_mode_hap(d)));
>> 
>> What is the !paging_mode_hap() part good for? It's being taken care
>> of by checking opt_dom0_shadow already, isn't it? Alternatively, to
>> make the distinction more obvious, I'd suggest
>> 
>>     need_paging = has_hvm_container_domain(d)
>>                   ? !iommu_hap_pt_share || !paging_mode_hap(d)
>>                   : opt_dom0_shadow;
> 
> AFAICT it *might* be possible to run a PVHv2 Dom0 on a box with no EPT, but
> with an IOMMU? Does that exist? In that case opt_dom0_shadow won't be set, but
> paging_mode_hap would be false. Maybe that's just an impossible combination in
> any case...

At least when running Xen itself virtualized, I wouldn't dare to assume
this is an impossible combination. However, I can't see how that case
would be handled any different by the original or the suggested
replacement expressions: need_paging would get set either way afaict.

>> > @@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
>> >              continue;
>> >          }
>> >  
>> > -        *entry_guest = *entry;
>> > -        pages = PFN_UP(entry_guest->size);
>> > +        /*
>> > +         * Make sure the start and length are aligned to PAGE_SIZE, because
>> > +         * that's the minimum granularity of the 2nd stage translation. Since
>> > +         * the p2m code uses PAGE_ORDER_4K internally, also use it here in
>> > +         * order to prevent this code from getting out of sync.
>> > +         */
>> > +        start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT);
>> 
>> You definitely don't need to use _AC() in C code. But the whole thing
>> can anyway simply be
>> 
>>         start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K);
>> 
>> (albeit I'd like to note that if anything we'd have to be prepared
>> for page sizes > 4k, not smaller ones, and the whole idea of
>> PAGE_ORDER_4K breaks in that case).
> 
> Thanks, I will change as per your recommendation above, although I'm not sure
> what to do with the PAGE_ORDER_4K thing. Are you fine with leaving it like you
> suggest?

Yes, there's far more broken code in that case, and hence the remark
was in parentheses in an attempt to make clear it's really just a remark.

>> > +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
>> > +{
>> > +    p2m_type_t p2mt;
>> > +    uint32_t rc, *ident_pt;
>> > +    uint8_t *tss;
>> > +    mfn_t mfn;
>> > +    paddr_t gaddr;
>> > +    unsigned int i;
>> > +
>> > +    /*
>> > +     * Steal some space from the last found RAM region. One page will be
>> > +     * used for the identity page tables, and the remaining space for the
>> > +     * VM86 TSS. Note that after this not all e820 regions will be aligned
>> > +     * to PAGE_SIZE.
>> > +     */
>> > +    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) )
>> > +    {
>> > +        printk("Unable to find memory to stash the identity map and TSS\n");
>> > +        return -ENOMEM;
>> > +    }
>> > +
>> > +    /*
>> > +     * Identity-map page table is required for running with CR0.PG=0
>> > +     * when using Intel EPT. Create a 32-bit non-PAE page directory of
>> > +     * superpages.
>> > +     */
>> > +    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
>> > +                              &mfn, &p2mt, 0, &rc);
>> > +    if ( ident_pt == NULL )
>> > +    {
>> > +        printk("Unable to map identity page tables\n");
>> > +        return -ENOMEM;
>> > +    }
>> > +    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
>> > +        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
>> > +                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
>> > +    unmap_domain_page(ident_pt);
>> > +    put_page(mfn_to_page(mfn_x(mfn)));
>> > +    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
>> > +    gaddr += PAGE_SIZE;
>> > +    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));
>> 
>> This comes too late - the page table setup above also requires
>> page alignment (and with that, adding PAGE_SIZE would not break
>> the alignment requirement). Even more, the code below doesn't
>> strictly require page alignment, it only requires for the range to
>> not cross a page boundary.
> 
> Given the change that you requested in pvh_steal_ram, now the start of the
> memory area returned by it it's not going to be page-aligned, so I will have to
> perform the TSS setup first, and then the identity page tables.

Or simply pass the required alignment.

Jan

Roger Pau Monne Jan. 27, 2017, 4:04 p.m. UTC | #15

On Fri, Jan 27, 2017 at 08:11:56AM -0700, Jan Beulich wrote:
> >>> On 27.01.17 at 13:23, <roger.pau@citrix.com> wrote:
> > On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote:
> >> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> >> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
> >> >  static long __initdata dom0_min_nrpages;
> >> >  static long __initdata dom0_max_nrpages = LONG_MAX;
> >> >  
> >> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> >> > +#define HVM_VM86_TSS_SIZE   128
> >> 
> >> I continue to be puzzled by this value. Why 128? I think this really
> >> needs to be clarified in the comment.
> > 
> > Given the recent comments by Tim, and that this is starting to look like a can
> > of worms, I would like to leave this as-is for the moment, on the grounds that
> > it's what hvmloader does (I'm not saying it's right), and that this issue
> > should be treated independently from this patch series.
> 
> Well, for the purpose of this patch it would be sufficient if the
> comment referred to hvmloader. But then I think I saw you set the
> TSS limit to 0x67, which is neither in line with the value above nor

Hm, no, I'm not setting the limit anywhere here, this is done in
vmx_set_segment_register, and it's indeed set to 0xff which is wrong for
hvmloader too according to the conversation that's going on related to this
HVM_VM86_TSS_SIZE param.

> - according to what Tim said (but I didn't check myself yet) - the
> 255 used in hvmloader. I.e. if you clone hvmloader code, all
> aspects of it should match.
> 
> > Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0.
> > IIRC I've tried that before (without unrestricted mode support) and it was
> > working fine.
> 
> Now if that's the case, then why bother with the TSS?

It seems like it working was just luck, but I don't know all the details. Maybe
the emulator is somehow fixing this up when the TSS is corrupted/incorrect?

> >> > @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages(
> >> >              avail -= max_pdx >> s;
> >> >      }
> >> >  
> >> > -    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
> >> > +    need_paging = opt_dom0_shadow ||
> >> > +                  (has_hvm_container_domain(d) && (!iommu_hap_pt_share ||
> >> > +                                                   !paging_mode_hap(d)));
> >> 
> >> What is the !paging_mode_hap() part good for? It's being taken care
> >> of by checking opt_dom0_shadow already, isn't it? Alternatively, to
> >> make the distinction more obvious, I'd suggest
> >> 
> >>     need_paging = has_hvm_container_domain(d)
> >>                   ? !iommu_hap_pt_share || !paging_mode_hap(d)
> >>                   : opt_dom0_shadow;
> > 
> > AFAICT it *might* be possible to run a PVHv2 Dom0 on a box with no EPT, but
> > with an IOMMU? Does that exist? In that case opt_dom0_shadow won't be set, but
> > paging_mode_hap would be false. Maybe that's just an impossible combination in
> > any case...
> 
> At least when running Xen itself virtualized, I wouldn't dare to assume
> this is an impossible combination. However, I can't see how that case
> would be handled any different by the original or the suggested
> replacement expressions: need_paging would get set either way afaict.

Oh yes, sorry, my reply was to the "What is the !paging_mode_hap() part good
for?" question. I've changed setting need_paging as you suggested.

> > Given the change that you requested in pvh_steal_ram, now the start of the
> > memory area returned by it it's not going to be page-aligned, so I will have to
> > perform the TSS setup first, and then the identity page tables.
> 
> Or simply pass the required alignment.

Passing an alignment here would mean that pvh_steal_ram would have to return 2
pages in order to meet this alignment, and we would end up wasting memory.
Also, this is the only caller of pvh_steal_ram that requires alignment. This is
what I have after changing pvh_steal_ram to remove RAM from the end of the
region:

static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
{
    p2m_type_t p2mt;
    uint32_t rc, *ident_pt;
    uint8_t *tss;
    mfn_t mfn;
    paddr_t gaddr;

    /*
     * Steal some space from the last found RAM region. One page will be
     * used for the identity page tables, and the remaining space for the
     * VM86 TSS. Note that after this not all e820 regions will be aligned
     * to PAGE_SIZE.
     */
    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, GB(4), &gaddr) )
    {
        printk("Unable to find memory to stash the identity map and TSS\n");
        return -ENOMEM;
    }

    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
                         &mfn, &p2mt, 0, &rc);
    if ( tss )
    {
        memset(tss, 0, HVM_VM86_TSS_SIZE);
        unmap_domain_page(tss);
        put_page(mfn_to_page(mfn_x(mfn)));
        d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
    }
    else
        printk("Unable to map VM86 TSS area\n");

    gaddr += HVM_VM86_TSS_SIZE;
    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));

    /*
     * Identity-map page table is required for running with CR0.PG=0
     * when using Intel EPT. Create a 32-bit non-PAE page directory of
     * superpages.
     */
    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
                              &mfn, &p2mt, 0, &rc);
    if ( ident_pt == NULL )
    {
        printk("Unable to map identity page tables\n");
        return -ENOMEM;
    }
    write_32bit_pse_identmap(ident_pt);
    unmap_domain_page(ident_pt);
    put_page(mfn_to_page(mfn_x(mfn)));
    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;

    return 0;
}

Roger.

Jan Beulich Jan. 27, 2017, 4:29 p.m. UTC | #16

>>> On 27.01.17 at 17:04, <roger.pau@citrix.com> wrote:
> On Fri, Jan 27, 2017 at 08:11:56AM -0700, Jan Beulich wrote:
>> >>> On 27.01.17 at 13:23, <roger.pau@citrix.com> wrote:
>> > On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote:
>> >> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>> >> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
>> >> >  static long __initdata dom0_min_nrpages;
>> >> >  static long __initdata dom0_max_nrpages = LONG_MAX;
>> >> >  
>> >> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
>> >> > +#define HVM_VM86_TSS_SIZE   128
>> >> 
>> >> I continue to be puzzled by this value. Why 128? I think this really
>> >> needs to be clarified in the comment.
>> > 
>> > Given the recent comments by Tim, and that this is starting to look like a can
>> > of worms, I would like to leave this as-is for the moment, on the grounds that
>> > it's what hvmloader does (I'm not saying it's right), and that this issue
>> > should be treated independently from this patch series.
>> 
>> Well, for the purpose of this patch it would be sufficient if the
>> comment referred to hvmloader. But then I think I saw you set the
>> TSS limit to 0x67, which is neither in line with the value above nor
> 
> Hm, no, I'm not setting the limit anywhere here, this is done in
> vmx_set_segment_register,

Well, you do, in patch 8 (in pvh_setup_cpus()). But that's a different
TSS, so the limits are independent. It's just what I had in mind here.

> and it's indeed set to 0xff which is wrong for
> hvmloader too according to the conversation that's going on related to this
> HVM_VM86_TSS_SIZE param.

Right.

>> - according to what Tim said (but I didn't check myself yet) - the
>> 255 used in hvmloader. I.e. if you clone hvmloader code, all
>> aspects of it should match.
>> 
>> > Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0.
>> > IIRC I've tried that before (without unrestricted mode support) and it was
>> > working fine.
>> 
>> Now if that's the case, then why bother with the TSS?
> 
> It seems like it working was just luck, but I don't know all the details. Maybe
> the emulator is somehow fixing this up when the TSS is corrupted/incorrect?

I don't think so. Btw, why is the kernel dropping back into real mode
anyway? It's being started in protected mode after all.

>> > Given the change that you requested in pvh_steal_ram, now the start of the
>> > memory area returned by it it's not going to be page-aligned, so I will have to
>> > perform the TSS setup first, and then the identity page tables.
>> 
>> Or simply pass the required alignment.
> 
> Passing an alignment here would mean that pvh_steal_ram would have to return 2
> pages in order to meet this alignment, and we would end up wasting memory.
> Also, this is the only caller of pvh_steal_ram that requires alignment. This is
> what I have after changing pvh_steal_ram to remove RAM from the end of the
> region:
> 
> static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
> {
>     p2m_type_t p2mt;
>     uint32_t rc, *ident_pt;
>     uint8_t *tss;
>     mfn_t mfn;
>     paddr_t gaddr;
> 
>     /*
>      * Steal some space from the last found RAM region. One page will be
>      * used for the identity page tables, and the remaining space for the
>      * VM86 TSS. Note that after this not all e820 regions will be aligned
>      * to PAGE_SIZE.
>      */
>     if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, GB(4), &gaddr) )
>     {
>         printk("Unable to find memory to stash the identity map and TSS\n");
>         return -ENOMEM;
>     }
> 
>     tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
>                          &mfn, &p2mt, 0, &rc);
>     if ( tss )
>     {
>         memset(tss, 0, HVM_VM86_TSS_SIZE);
>         unmap_domain_page(tss);
>         put_page(mfn_to_page(mfn_x(mfn)));
>         d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
>     }
>     else
>         printk("Unable to map VM86 TSS area\n");
> 
>     gaddr += HVM_VM86_TSS_SIZE;
>     ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));

And this assert holds merely because, prior to this function running,
all E820 entries are page aligned? That's rather fragile then.
Considering that getting into here is going to be increasingly unlikely
going forward, I don't think we should be afraid of wasting a little
bit of memory here.

Jan

Jan Beulich Jan. 27, 2017, 4:40 p.m. UTC | #17

>>> On 27.01.17 at 14:20, <tim@xen.org> wrote:
> At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote:
>> On 27/01/17 11:14, Tim Deegan wrote:
>> > But looking at it now, I'm not convinced of exactly how.  The magic
>> > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
>> > base address itself lives at offset 100.  A zero'd TSS should mean an
>> > I/O map at 0, and an interrupt redirection bitmap at -32, which would
>> > plausibly work if the TSS were 256 bytes (matching the limit set in
>> > Xen).  Perhaps it's only working because the 128 bytes following the
>> > TSS in hvmloader happen to be zeros too?
>> 
>> With an IO_base_map of 0, the software interrupt bitmap will end up
>> being ahead of the TSS, not after it.
> 
> I should have thought that the segmented address calculation would
> wrap and leave us at TSS + 224.

I don't think wrapping takes the limit value into account. It's all
linear address calculations, and as Andrew says the assumption
in microcode likely is that things will be set up properly by any
OS interested in using the interrupt bitmap.

>> > I also don't remember why the TSS is 128 rather than 104 bytes.  The
>> > SDM claims that the TSS must be larger than 104 bytes "when accessing
>> > the I/O permission bit map or interrupt redirection bit map."
>> > (7.2.2. "TSS Descriptor") but I suspect that just means that the
>> > generated address of the bitmap must lie inside the limit.
>> 
>> The documented way of expressing "no IO bitmap" is to set the map base
>> to a value which exceeds the TSS limit.  All this means (I think) is
>> that you must make a larger than default TSS if you want to use a IO or
>> software interrupt bitmap.
> 
> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
> enough space for a full one, but the current SDM is pretty clear that
> the CPU will try to check it in virtual 8086 mode.
> 
> It may be that all the ports actually used happen to fall in the 128
> bytes of zeros that we provide.

I suppose so: This is precisely enough for the ISA port range.

So what we'll need to do then, as I understand it from the
discussion so far:

- vmx_set_segment_register() will need to set a correct limit
- vmx_set_segment_register() should initialize the TSS every
  time (including setting the I/O bitmap address to no lower
  than 32)
- hvmloader's init_vm86_tss() will need to allocate 160 bytes
  rather than 128 (and we should expose this number, so that
  Roger can also use it)

Perhaps we should even introduce a hypercall for hvmloader
to query the needed value, rather than exposing a hardcoded
number?

Jan

Andrew Cooper Jan. 27, 2017, 6:06 p.m. UTC | #18

On 27/01/17 16:40, Jan Beulich wrote:
>>>> On 27.01.17 at 14:20, <tim@xen.org> wrote:
>> At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote:
>>> On 27/01/17 11:14, Tim Deegan wrote:
>>>> But looking at it now, I'm not convinced of exactly how.  The magic
>>>> bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
>>>> base address itself lives at offset 100.  A zero'd TSS should mean an
>>>> I/O map at 0, and an interrupt redirection bitmap at -32, which would
>>>> plausibly work if the TSS were 256 bytes (matching the limit set in
>>>> Xen).  Perhaps it's only working because the 128 bytes following the
>>>> TSS in hvmloader happen to be zeros too?
>>> With an IO_base_map of 0, the software interrupt bitmap will end up
>>> being ahead of the TSS, not after it.
>> I should have thought that the segmented address calculation would
>> wrap and leave us at TSS + 224.
> I don't think wrapping takes the limit value into account. It's all
> linear address calculations, and as Andrew says the assumption
> in microcode likely is that things will be set up properly by any
> OS interested in using the interrupt bitmap.
>
>>>> I also don't remember why the TSS is 128 rather than 104 bytes.  The
>>>> SDM claims that the TSS must be larger than 104 bytes "when accessing
>>>> the I/O permission bit map or interrupt redirection bit map."
>>>> (7.2.2. "TSS Descriptor") but I suspect that just means that the
>>>> generated address of the bitmap must lie inside the limit.
>>> The documented way of expressing "no IO bitmap" is to set the map base
>>> to a value which exceeds the TSS limit.  All this means (I think) is
>>> that you must make a larger than default TSS if you want to use a IO or
>>> software interrupt bitmap.
>> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
>> enough space for a full one, but the current SDM is pretty clear that
>> the CPU will try to check it in virtual 8086 mode.
>>
>> It may be that all the ports actually used happen to fall in the 128
>> bytes of zeros that we provide.
> I suppose so: This is precisely enough for the ISA port range.
>
> So what we'll need to do then, as I understand it from the
> discussion so far:
>
> - vmx_set_segment_register() will need to set a correct limit
> - vmx_set_segment_register() should initialize the TSS every
>   time (including setting the I/O bitmap address to no lower
>   than 32)
> - hvmloader's init_vm86_tss() will need to allocate 160 bytes
>   rather than 128 (and we should expose this number, so that
>   Roger can also use it)
>
> Perhaps we should even introduce a hypercall for hvmloader
> to query the needed value, rather than exposing a hardcoded
> number?

I suggest we remove all responsibility of managing this from hvmloader. 
The only thing hvmloader does is allocate space for it, and reserve it
in the E820.

It is conceptually related to IDENT_PT, although the IDENT_PT must be
allocated and filled in by the domain builder for the HVM guest to
function.  It would be cleaner for the domain builder to also allocate
an adjacent page for the VM86_TSS when it constructs the IDENT_PT.

All HVMLoader needs to do is read the two hvmparams and adjust the E820
table suitably.

Finally, the IO bitmap needs to be a fraction larger than 160 bytes.

From tools/firmware/rombios/rombios.h:

#define PANIC_PORT  0x400
#define PANIC_PORT2 0x401
#define INFO_PORT   0x402
#define DEBUG_PORT  0x403

which are just above the ISA range.  I'd also just allocate a full page
for it; no OS is going to bother trying to use fractions of a page
around an E820 reserved region.

~Andrew

Tim Deegan Jan. 27, 2017, 7:43 p.m. UTC | #19

> Despite being owned by the guest, this TSS is actually managed by
Xen.
> It should be initialised to defaults each time Xen needs to use it
on
> behalf of the guest.

At 14:35 +0000 on 27 Jan (1485527708), Andrew Cooper wrote:
> On 27/01/17 14:01, Tim Deegan wrote:
> > Hi,
> >
> > At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote:
> >> The actual behaviour can be determined by putting the TSS on a page
> >> boundary, making the previous frame non-readable via EPT, and seeing
> >> whether an EPT violation occurs.
> > Indeed.  Or likewise with normal pagetables. 
> >
> >>> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
> >>> enough space for a full one, but the current SDM is pretty clear that
> >>> the CPU will try to check it in virtual 8086 mode.
> >>>
> >>> It may be that all the ports actually used happen to fall in the 128
> >>> bytes of zeros that we provide.
> >> With an offset of 0, we actually provide 256 bytes of zeros in the
> >> bitmap within the TSS limit.
> > Sure, or at least 128 bytes of zeros and another 128 bytes of something.
> 
> That is a good point.  Nothing prevents a guest exiting vm86 mode, and
> using a task switch to move to a new tss, which will cause Xen to write
> state back into the vm86_tss, making it no longer a zeroed block of memory.
> 
> Despite being owned by the guest, this TSS is actually managed by Xen. 
> It should be initialised to defaults each time Xen needs to use it on
> behalf of the guest.

But it's already in an E820 reserved block - if the guest overwrites
it (with a task switch or otherwise) it will break real-mode support,
but this is no worse than nobbling any other part of the BIOS state.

If we're making it non-zero, I can see an argument for having Xen init
the contents once (maybe when the HVM param is written?) so that it
matches what Xen expects of it.  But resetting it every time we use it
would be overkill.

> >> We set IOPL to 3 as well as when entering vm86 to fake up real mode. 
> >> This bypasses all I/O bitmap checks (a properly common to ring 3
> >> protected tasks as well - See specifically 20.2.7 "Sensitive
> >> Instructions"), which means the IN/OUT instructions end up directly at
> >> the relevant vmexit case.
> > 20.2.8.1 makes it clear that this is not the case -- in virtual 8086
> > mode all IN/OUT ops check the bitmap event with IOPL == CPL.
> 
> Hmm.  Right you area, which explains why the TSS limit is greater than
> 0x67. 
> 
> If the emulation code were working correctly, the emulator should come
> to the same conclusion as hardware and inject a #GP fault.

I don't think so -- the emulator is emulating actual real-mode, not
virtual 8086 mode, so it shouldn't fault on any IO port accesses.

Cheers,

Tim.

Tim Deegan Jan. 27, 2017, 7:48 p.m. UTC | #20

At 09:40 -0700 on 27 Jan (1485510008), Jan Beulich wrote:
> >>> On 27.01.17 at 14:20, <tim@xen.org> wrote:
> > At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote:
> >> On 27/01/17 11:14, Tim Deegan wrote:
> >> > But looking at it now, I'm not convinced of exactly how.  The magic
> >> > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
> >> > base address itself lives at offset 100.  A zero'd TSS should mean an
> >> > I/O map at 0, and an interrupt redirection bitmap at -32, which would
> >> > plausibly work if the TSS were 256 bytes (matching the limit set in
> >> > Xen).  Perhaps it's only working because the 128 bytes following the
> >> > TSS in hvmloader happen to be zeros too?
> >> 
> >> With an IO_base_map of 0, the software interrupt bitmap will end up
> >> being ahead of the TSS, not after it.
> > 
> > I should have thought that the segmented address calculation would
> > wrap and leave us at TSS + 224.
> 
> I don't think wrapping takes the limit value into account.

Quite right, I'm talking nonsense.

> - vmx_set_segment_register() will need to set a correct limit

Yep.

> - vmx_set_segment_register() should initialize the TSS every
>   time (including setting the I/O bitmap address to no lower
>   than 32)

Probably to no lower than 136, to avoid having the bits of that field
itself appearing in either the IO or interrupt bitmap.

> - hvmloader's init_vm86_tss() will need to allocate 160 bytes
>   rather than 128 (and we should expose this number, so that
>   Roger can also use it)
> 
> Perhaps we should even introduce a hypercall for hvmloader
> to query the needed value, rather than exposing a hardcoded
> number?

I think Andrew's suggestion of just using a whole page is a good
one.  The TSS is a 32-bit one, after all, and doesn't need to live in
BIOS space.

Cheers,

Tim.

Jan Beulich Jan. 30, 2017, 10:43 a.m. UTC | #21

>>> On 27.01.17 at 20:43, <tim@xen.org> wrote:

>> Despite being owned by the guest, this TSS is actually managed by
> Xen.
>> It should be initialised to defaults each time Xen needs to use it
> on
>> behalf of the guest.
> 
> At 14:35 +0000 on 27 Jan (1485527708), Andrew Cooper wrote:
>> On 27/01/17 14:01, Tim Deegan wrote:
>> > Hi,
>> >
>> > At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote:
>> >> The actual behaviour can be determined by putting the TSS on a page
>> >> boundary, making the previous frame non-readable via EPT, and seeing
>> >> whether an EPT violation occurs.
>> > Indeed.  Or likewise with normal pagetables. 
>> >
>> >>> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
>> >>> enough space for a full one, but the current SDM is pretty clear that
>> >>> the CPU will try to check it in virtual 8086 mode.
>> >>>
>> >>> It may be that all the ports actually used happen to fall in the 128
>> >>> bytes of zeros that we provide.
>> >> With an offset of 0, we actually provide 256 bytes of zeros in the
>> >> bitmap within the TSS limit.
>> > Sure, or at least 128 bytes of zeros and another 128 bytes of something.
>> 
>> That is a good point.  Nothing prevents a guest exiting vm86 mode, and
>> using a task switch to move to a new tss, which will cause Xen to write
>> state back into the vm86_tss, making it no longer a zeroed block of memory.
>> 
>> Despite being owned by the guest, this TSS is actually managed by Xen. 
>> It should be initialised to defaults each time Xen needs to use it on
>> behalf of the guest.
> 
> But it's already in an E820 reserved block - if the guest overwrites
> it (with a task switch or otherwise) it will break real-mode support,
> but this is no worse than nobbling any other part of the BIOS state.
> 
> If we're making it non-zero, I can see an argument for having Xen init
> the contents once (maybe when the HVM param is written?) so that it
> matches what Xen expects of it.  But resetting it every time we use it
> would be overkill.

That wasn't the point Andrew was making, I think. A task switch
initiated by the guest would make the hypervisor write into that
TSS (as the outgoing one). Of course any sane guest would do an
LTR first (or else it would risk memory near address zero to get
clobbered on real hardware).

Jan

Andrew Cooper Jan. 30, 2017, 11:06 a.m. UTC | #22

On 30/01/17 10:43, Jan Beulich wrote:
>>>> On 27.01.17 at 20:43, <tim@xen.org> wrote:
>>> Despite being owned by the guest, this TSS is actually managed by
>> Xen.
>>> It should be initialised to defaults each time Xen needs to use it
>> on
>>> behalf of the guest.
>> At 14:35 +0000 on 27 Jan (1485527708), Andrew Cooper wrote:
>>> On 27/01/17 14:01, Tim Deegan wrote:
>>>> Hi,
>>>>
>>>> At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote:
>>>>> The actual behaviour can be determined by putting the TSS on a page
>>>>> boundary, making the previous frame non-readable via EPT, and seeing
>>>>> whether an EPT violation occurs.
>>>> Indeed.  Or likewise with normal pagetables. 
>>>>
>>>>>> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
>>>>>> enough space for a full one, but the current SDM is pretty clear that
>>>>>> the CPU will try to check it in virtual 8086 mode.
>>>>>>
>>>>>> It may be that all the ports actually used happen to fall in the 128
>>>>>> bytes of zeros that we provide.
>>>>> With an offset of 0, we actually provide 256 bytes of zeros in the
>>>>> bitmap within the TSS limit.
>>>> Sure, or at least 128 bytes of zeros and another 128 bytes of something.
>>> That is a good point.  Nothing prevents a guest exiting vm86 mode, and
>>> using a task switch to move to a new tss, which will cause Xen to write
>>> state back into the vm86_tss, making it no longer a zeroed block of memory.
>>>
>>> Despite being owned by the guest, this TSS is actually managed by Xen. 
>>> It should be initialised to defaults each time Xen needs to use it on
>>> behalf of the guest.
>> But it's already in an E820 reserved block - if the guest overwrites
>> it (with a task switch or otherwise) it will break real-mode support,
>> but this is no worse than nobbling any other part of the BIOS state.
>>
>> If we're making it non-zero, I can see an argument for having Xen init
>> the contents once (maybe when the HVM param is written?) so that it
>> matches what Xen expects of it.  But resetting it every time we use it
>> would be overkill.
> That wasn't the point Andrew was making, I think. A task switch
> initiated by the guest would make the hypervisor write into that
> TSS (as the outgoing one). Of course any sane guest would do an
> LTR first (or else it would risk memory near address zero to get
> clobbered on real hardware).

Thinking about it, this depends on whether we properly save and restore
the protected mode %tr around entering and exiting faked-up real mode.

If the saving and restoring is already done properly, then I think my
concern is unfounded.

~Andrew

Jan Beulich Feb. 2, 2017, 3:38 p.m. UTC | #23

>>> On 27.01.17 at 20:48, <tim@xen.org> wrote:
> At 09:40 -0700 on 27 Jan (1485510008), Jan Beulich wrote:
>> - vmx_set_segment_register() should initialize the TSS every
>>   time (including setting the I/O bitmap address to no lower
>>   than 32)
> 
> Probably to no lower than 136, to avoid having the bits of that field
> itself appearing in either the IO or interrupt bitmap.

Indeed.

>> - hvmloader's init_vm86_tss() will need to allocate 160 bytes
>>   rather than 128 (and we should expose this number, so that
>>   Roger can also use it)
>> 
>> Perhaps we should even introduce a hypercall for hvmloader
>> to query the needed value, rather than exposing a hardcoded
>> number?
> 
> I think Andrew's suggestion of just using a whole page is a good
> one.  The TSS is a 32-bit one, after all, and doesn't need to live in
> BIOS space.

Hmm, any size increase will need to come with further changes,
as it looks, including the use of a new HVM param: The VM86_TSS
param is being migrated, and hence for an incoming VM we need
to be able to tell whether the guest has set aside 128 bytes or a
full page. This of course implies that we need to keep Xen handle
the 128-byte case, too.

And if we somehow expect that a single page may not suffice in
the future, it may even be advisable to store an (address,size)
pair as param.

Jan

Jan Beulich Feb. 3, 2017, 1:57 p.m. UTC | #24

>>> On 27.01.17 at 19:06, <andrew.cooper3@citrix.com> wrote:
> On 27/01/17 16:40, Jan Beulich wrote:
>> So what we'll need to do then, as I understand it from the
>> discussion so far:
>>
>> - vmx_set_segment_register() will need to set a correct limit
>> - vmx_set_segment_register() should initialize the TSS every
>>   time (including setting the I/O bitmap address to no lower
>>   than 32)
>> - hvmloader's init_vm86_tss() will need to allocate 160 bytes
>>   rather than 128 (and we should expose this number, so that
>>   Roger can also use it)
>>
>> Perhaps we should even introduce a hypercall for hvmloader
>> to query the needed value, rather than exposing a hardcoded
>> number?
> 
> I suggest we remove all responsibility of managing this from hvmloader. 
> The only thing hvmloader does is allocate space for it, and reserve it
> in the E820.

While I did it that way for now, I'm no longer convinced this is
useful. With multiple vCPU-s, a guest can do whatever it wants to
this TSS anyway, regardless of whether Xen currently thinks it's
using a suitably initialized memory block. And whatever the guest
does, any non-zero bit in that area will only slow it down (due to
the VM exits resulting from the #GP faults caused by those 1 bits,
resulting in the respective I/O or INTnn insns being carried out by
the emulator).

> It is conceptually related to IDENT_PT, although the IDENT_PT must be
> allocated and filled in by the domain builder for the HVM guest to
> function.  It would be cleaner for the domain builder to also allocate
> an adjacent page for the VM86_TSS when it constructs the IDENT_PT.

I'll leave that for someone else to carry out; for now allocation
will remain in hvmloader.

> Finally, the IO bitmap needs to be a fraction larger than 160 bytes.
> 
> From tools/firmware/rombios/rombios.h:
> 
> #define PANIC_PORT  0x400
> #define PANIC_PORT2 0x401
> #define INFO_PORT   0x402
> #define DEBUG_PORT  0x403
> 
> which are just above the ISA range.

Which causes only slowness (due to needing the emulator to carry
out the instruction), but no lack of functionality.

>  I'd also just allocate a full page
> for it; no OS is going to bother trying to use fractions of a page
> around an E820 reserved region.

But the smaller range may well be part of an already partially used
page. Together with the fact that any port accesses not covered
by the bitmap would still be correctly handled, I'd prefer to make
the TSS 0x68 + 0x20 + 0x80 + 1 bytes large (base structure plus
interrupt redirection bitmap plus I/O bitmap plus trailing byte),
which, due to the goal of avoiding page boundaries in the middle,
would mean a 512 byte block aligned to a 512-byte boundary.

Jan

[v5,4/9] xen/x86: populate PVHv2 Dom0 physical memory map

Commit Message

Comments

Patch