diff mbox

[6/6] x86/hvm: Implement hvmemul_write() using real mappings

Message ID 1498057952-13556-7-git-send-email-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andrew Cooper June 21, 2017, 3:12 p.m. UTC
An access which crosses a page boundary is performed atomically by x86
hardware, albeit with a severe performance penalty.  An important corner case
is when a straddled access hits two pages which differ in whether a
translation exists, or in net access rights.

The use of hvm_copy*() in hvmemul_write() is problematic, because it performs
a translation then completes the partial write, before moving onto the next
translation.

If an individual emulated write straddles two pages, the first of which is
writable, and the second of which is not, the first half of the write will
complete before #PF is raised from the second half.

This results in guest state corruption as a side effect of emulation, which
has been observed to cause windows to crash while under introspection.

Introduce the hvmemul_{,un}map_linear_addr() helpers, which translate an
entire contents of a linear access, and vmap() the underlying frames to
provide a contiguous virtual mapping for the emulator to use.  This is the
same mechanism as used by the shadow emulation code.

This will catch any translation issues and abort the emulation before any
modifications occur.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Paul Durrant <paul.durrant@citrix.com>
CC: Razvan Cojocaru <rcojocaru@bitdefender.com>
CC: Mihai Donțu <mdontu@bitdefender.com>

While the maximum size of linear mapping is capped at 1 page, the logic in the
helpers is written to work properly as hvmemul_ctxt->mfn[] gets longer,
specifically with XSAVE instruction emulation in mind.

This has only had light testing so far.
---
 xen/arch/x86/hvm/emulate.c        | 179 ++++++++++++++++++++++++++++++++++----
 xen/include/asm-x86/hvm/emulate.h |   7 ++
 2 files changed, 169 insertions(+), 17 deletions(-)

Comments

Paul Durrant June 21, 2017, 4:19 p.m. UTC | #1
> -----Original Message-----

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]

> Sent: 21 June 2017 16:13

> To: Xen-devel <xen-devel@lists.xen.org>

> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Jan Beulich

> <JBeulich@suse.com>; Paul Durrant <Paul.Durrant@citrix.com>; Razvan

> Cojocaru <rcojocaru@bitdefender.com>; Mihai Donțu

> <mdontu@bitdefender.com>

> Subject: [PATCH 6/6] x86/hvm: Implement hvmemul_write() using real

> mappings

> 

> An access which crosses a page boundary is performed atomically by x86

> hardware, albeit with a severe performance penalty.  An important corner

> case

> is when a straddled access hits two pages which differ in whether a

> translation exists, or in net access rights.

> 

> The use of hvm_copy*() in hvmemul_write() is problematic, because it

> performs

> a translation then completes the partial write, before moving onto the next

> translation.

> 

> If an individual emulated write straddles two pages, the first of which is

> writable, and the second of which is not, the first half of the write will

> complete before #PF is raised from the second half.

> 

> This results in guest state corruption as a side effect of emulation, which

> has been observed to cause windows to crash while under introspection.

> 

> Introduce the hvmemul_{,un}map_linear_addr() helpers, which translate an

> entire contents of a linear access, and vmap() the underlying frames to

> provide a contiguous virtual mapping for the emulator to use.  This is the

> same mechanism as used by the shadow emulation code.

> 

> This will catch any translation issues and abort the emulation before any

> modifications occur.

> 

> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

> ---

> CC: Jan Beulich <JBeulich@suse.com>

> CC: Paul Durrant <paul.durrant@citrix.com>

> CC: Razvan Cojocaru <rcojocaru@bitdefender.com>

> CC: Mihai Donțu <mdontu@bitdefender.com>

> 

> While the maximum size of linear mapping is capped at 1 page, the logic in

> the

> helpers is written to work properly as hvmemul_ctxt->mfn[] gets longer,

> specifically with XSAVE instruction emulation in mind.

> 

> This has only had light testing so far.

> ---

>  xen/arch/x86/hvm/emulate.c        | 179

> ++++++++++++++++++++++++++++++++++----

>  xen/include/asm-x86/hvm/emulate.h |   7 ++

>  2 files changed, 169 insertions(+), 17 deletions(-)

> 

> diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c

> index 384ad0b..804bea5 100644

> --- a/xen/arch/x86/hvm/emulate.c

> +++ b/xen/arch/x86/hvm/emulate.c

> @@ -498,6 +498,159 @@ static int hvmemul_do_mmio_addr(paddr_t

> mmio_gpa,

>  }

> 

>  /*

> + * Map the frame(s) covering an individual linear access, for writeable

> + * access.  May return NULL for MMIO, or ERR_PTR(~X86EMUL_*) for other

> errors

> + * including ERR_PTR(~X86EMUL_OKAY) for write-discard mappings.

> + *

> + * In debug builds, map() checks that each slot in hvmemul_ctxt->mfn[] is

> + * clean before use, and poisions unused slots with INVALID_MFN.

> + */

> +static void *hvmemul_map_linear_addr(

> +    unsigned long linear, unsigned int bytes, uint32_t pfec,

> +    struct hvm_emulate_ctxt *hvmemul_ctxt)

> +{

> +    struct vcpu *curr = current;

> +    void *err, *mapping;

> +

> +    /* First and final gfns which need mapping. */

> +    unsigned long frame = linear >> PAGE_SHIFT, first = frame;

> +    unsigned long final = (linear + bytes - !!bytes) >> PAGE_SHIFT;


Do we need to worry about linear + bytes overflowing here?

Also, is this ever legitimately called with bytes == 0?

> +

> +    /*

> +     * mfn points to the next free slot.  All used slots have a page reference

> +     * held on them.

> +     */

> +    mfn_t *mfn = &hvmemul_ctxt->mfn[0];

> +

> +    /*

> +     * The caller has no legitimate reason for trying a zero-byte write, but

> +     * final is calculate to fail safe in release builds.

> +     *

> +     * The maximum write size depends on the number of adjacent mfns[]

> which

> +     * can be vmap()'d, accouting for possible misalignment within the region.

> +     * The higher level emulation callers are responsible for ensuring that

> +     * mfns[] is large enough for the requested write size.

> +     */

> +    if ( bytes == 0 ||

> +         final - first > ARRAY_SIZE(hvmemul_ctxt->mfn) - 1 )

> +    {


I guess not, so why the weird looking calculation for final? It's value will not be used when bytes == 0.

> +        ASSERT_UNREACHABLE();

> +        goto unhandleable;

> +    }

> +

> +    do {

> +        enum hvm_translation_result res;

> +        struct page_info *page;

> +        pagefault_info_t pfinfo;

> +        p2m_type_t p2mt;

> +

> +        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT, true, pfec,

> +                                     &pfinfo, &page, NULL, &p2mt);

> +

> +        switch ( res )

> +        {

> +        case HVMTRANS_okay:

> +            break;

> +

> +        case HVMTRANS_bad_linear_to_gfn:

> +            x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);

> +            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);

> +            goto out;

> +

> +        case HVMTRANS_bad_gfn_to_mfn:

> +            err = NULL;

> +            goto out;

> +

> +        case HVMTRANS_gfn_paged_out:

> +        case HVMTRANS_gfn_shared:

> +            err = ERR_PTR(~(long)X86EMUL_RETRY);

> +            goto out;

> +

> +        default:

> +            goto unhandleable;

> +        }

> +

> +        /* Error checking.  Confirm that the current slot is clean. */

> +        ASSERT(mfn_x(*mfn) == 0);

> +

> +        *mfn++ = _mfn(page_to_mfn(page));

> +        frame++;

> +

> +        if ( p2m_is_discard_write(p2mt) )

> +        {

> +            err = ERR_PTR(~(long)X86EMUL_OKAY);

> +            goto out;

> +        }

> +

> +    } while ( frame < final );

> +

> +    /* Entire access within a single frame? */

> +    if ( first == final )

> +        mapping = map_domain_page(hvmemul_ctxt->mfn[0]) + (linear &

> ~PAGE_MASK);

> +    /* Multiple frames? Need to vmap(). */

> +    else if ( (mapping = vmap(hvmemul_ctxt->mfn,

> +                              mfn - hvmemul_ctxt->mfn)) == NULL )

> +        goto unhandleable;

> +

> +#ifndef NDEBUG /* Poision unused mfn[]s with INVALID_MFN. */

> +    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )

> +    {

> +        ASSERT(mfn_x(*mfn) == 0);

> +        *mfn++ = INVALID_MFN;

> +    }

> +#endif

> +

> +    return mapping;

> +

> + unhandleable:

> +    err = ERR_PTR(~(long)X86EMUL_UNHANDLEABLE);

> +

> + out:

> +    /* Drop all held references. */

> +    while ( mfn > hvmemul_ctxt->mfn )

> +        put_page(mfn_to_page(mfn_x(*mfn--)));

> +

> +    return err;

> +}

> +

> +static void hvmemul_unmap_linear_addr(

> +    void *mapping, unsigned long linear, unsigned int bytes,

> +    struct hvm_emulate_ctxt *hvmemul_ctxt)

> +{

> +    struct domain *currd = current->domain;

> +    unsigned long frame = linear >> PAGE_SHIFT;

> +    unsigned long final = (linear + bytes - !!bytes) >> PAGE_SHIFT;

> +    mfn_t *mfn = &hvmemul_ctxt->mfn[0];

> +

> +    ASSERT(bytes > 0);


Why not return if bytes == 0? I know it's not a legitimate call but in a non-debug build it would result in unmap_domain_page() being called below.

  Paul

> +

> +    if ( frame == final )

> +        unmap_domain_page(mapping);

> +    else

> +        vunmap(mapping);

> +

> +    do

> +    {

> +        ASSERT(mfn_valid(*mfn));

> +        paging_mark_dirty(currd, *mfn);

> +        put_page(mfn_to_page(mfn_x(*mfn)));

> +

> +        frame++;

> +        *mfn++ = _mfn(0); /* Clean slot for map()'s error checking. */

> +

> +    } while ( frame < final );

> +

> +

> +#ifndef NDEBUG /* Check (and clean) all unused mfns. */

> +    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )

> +    {

> +        ASSERT(mfn_eq(*mfn, INVALID_MFN));

> +        *mfn++ = _mfn(0);

> +    }

> +#endif

> +}

> +

> +/*

>   * Convert addr from linear to physical form, valid over the range

>   * [addr, addr + *reps * bytes_per_rep]. *reps is adjusted according to

>   * the valid computed range. It is always >0 when X86EMUL_OKAY is

> returned.

> @@ -987,11 +1140,11 @@ static int hvmemul_write(

>      struct hvm_emulate_ctxt *hvmemul_ctxt =

>          container_of(ctxt, struct hvm_emulate_ctxt, ctxt);

>      struct vcpu *curr = current;

> -    pagefault_info_t pfinfo;

>      unsigned long addr, reps = 1;

>      uint32_t pfec = PFEC_page_present | PFEC_write_access;

>      struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;

>      int rc;

> +    void *mapping;

> 

>      if ( is_x86_system_segment(seg) )

>          pfec |= PFEC_implicit;

> @@ -1007,23 +1160,15 @@ static int hvmemul_write(

>           (vio->mmio_gla == (addr & PAGE_MASK)) )

>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec,

> hvmemul_ctxt, 1);

> 

> -    rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec, &pfinfo);

> -

> -    switch ( rc )

> -    {

> -    case HVMTRANS_okay:

> -        break;

> -    case HVMTRANS_bad_linear_to_gfn:

> -        x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);

> -        return X86EMUL_EXCEPTION;

> -    case HVMTRANS_bad_gfn_to_mfn:

> +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec,

> hvmemul_ctxt);

> +    if ( IS_ERR(mapping) )

> +        return ~PTR_ERR(mapping);

> +    else if ( !mapping )

>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec,

> hvmemul_ctxt, 0);

> -    case HVMTRANS_gfn_paged_out:

> -    case HVMTRANS_gfn_shared:

> -        return X86EMUL_RETRY;

> -    default:

> -        return X86EMUL_UNHANDLEABLE;

> -    }

> +

> +    memcpy(mapping, p_data, bytes);

> +

> +    hvmemul_unmap_linear_addr(mapping, addr, bytes, hvmemul_ctxt);

> 

>      return X86EMUL_OKAY;

>  }

> diff --git a/xen/include/asm-x86/hvm/emulate.h b/xen/include/asm-

> x86/hvm/emulate.h

> index 8864775..65efd4e 100644

> --- a/xen/include/asm-x86/hvm/emulate.h

> +++ b/xen/include/asm-x86/hvm/emulate.h

> @@ -37,6 +37,13 @@ struct hvm_emulate_ctxt {

>      unsigned long seg_reg_accessed;

>      unsigned long seg_reg_dirty;

> 

> +    /*

> +     * MFNs behind temporary mappings in the write callback.  The length is

> +     * arbitrary, and can be increased if writes longer than PAGE_SIZE are

> +     * needed.

> +     */

> +    mfn_t mfn[2];

> +

>      uint32_t intr_shadow;

> 

>      bool_t set_context;

> --

> 2.1.4
Jan Beulich June 22, 2017, 9:06 a.m. UTC | #2
>>> On 21.06.17 at 17:12, <andrew.cooper3@citrix.com> wrote:
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -498,6 +498,159 @@ static int hvmemul_do_mmio_addr(paddr_t mmio_gpa,
>  }
>  
>  /*
> + * Map the frame(s) covering an individual linear access, for writeable
> + * access.  May return NULL for MMIO, or ERR_PTR(~X86EMUL_*) for other errors
> + * including ERR_PTR(~X86EMUL_OKAY) for write-discard mappings.
> + *
> + * In debug builds, map() checks that each slot in hvmemul_ctxt->mfn[] is
> + * clean before use, and poisions unused slots with INVALID_MFN.
> + */
> +static void *hvmemul_map_linear_addr(
> +    unsigned long linear, unsigned int bytes, uint32_t pfec,
> +    struct hvm_emulate_ctxt *hvmemul_ctxt)
> +{
> +    struct vcpu *curr = current;
> +    void *err, *mapping;
> +
> +    /* First and final gfns which need mapping. */
> +    unsigned long frame = linear >> PAGE_SHIFT, first = frame;
> +    unsigned long final = (linear + bytes - !!bytes) >> PAGE_SHIFT;

I second Paul's desire for the zero bytes case to be rejected up
the call stack.

> +    /*
> +     * mfn points to the next free slot.  All used slots have a page reference
> +     * held on them.
> +     */
> +    mfn_t *mfn = &hvmemul_ctxt->mfn[0];
> +
> +    /*
> +     * The caller has no legitimate reason for trying a zero-byte write, but
> +     * final is calculate to fail safe in release builds.
> +     *
> +     * The maximum write size depends on the number of adjacent mfns[] which
> +     * can be vmap()'d, accouting for possible misalignment within the region.
> +     * The higher level emulation callers are responsible for ensuring that
> +     * mfns[] is large enough for the requested write size.
> +     */
> +    if ( bytes == 0 ||
> +         final - first > ARRAY_SIZE(hvmemul_ctxt->mfn) - 1 )

Any reason not to use ">= ARRAY_SIZE(hvmemul_ctxt->mfn)"

> +    {
> +        ASSERT_UNREACHABLE();
> +        goto unhandleable;
> +    }
> +
> +    do {
> +        enum hvm_translation_result res;
> +        struct page_info *page;
> +        pagefault_info_t pfinfo;
> +        p2m_type_t p2mt;
> +
> +        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT, true, pfec,
> +                                     &pfinfo, &page, NULL, &p2mt);
> +
> +        switch ( res )
> +        {
> +        case HVMTRANS_okay:
> +            break;
> +
> +        case HVMTRANS_bad_linear_to_gfn:
> +            x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
> +            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);
> +            goto out;
> +
> +        case HVMTRANS_bad_gfn_to_mfn:
> +            err = NULL;
> +            goto out;
> +
> +        case HVMTRANS_gfn_paged_out:
> +        case HVMTRANS_gfn_shared:
> +            err = ERR_PTR(~(long)X86EMUL_RETRY);
> +            goto out;
> +
> +        default:
> +            goto unhandleable;
> +        }
> +
> +        /* Error checking.  Confirm that the current slot is clean. */
> +        ASSERT(mfn_x(*mfn) == 0);

Wouldn't this better be done first thing in the loop? And wouldn't
the value better be INVALID_MFN?

> +        *mfn++ = _mfn(page_to_mfn(page));
> +        frame++;
> +
> +        if ( p2m_is_discard_write(p2mt) )
> +        {
> +            err = ERR_PTR(~(long)X86EMUL_OKAY);
> +            goto out;

If one page is discard-write and the other isn't, this will end up
being wrong.

> +        }
> +
> +    } while ( frame < final );
> +
> +    /* Entire access within a single frame? */
> +    if ( first == final )
> +        mapping = map_domain_page(hvmemul_ctxt->mfn[0]) + (linear & ~PAGE_MASK);
> +    /* Multiple frames? Need to vmap(). */
> +    else if ( (mapping = vmap(hvmemul_ctxt->mfn,
> +                              mfn - hvmemul_ctxt->mfn)) == NULL )

final - first + 1 would likely yield better code.

> +        goto unhandleable;
> +
> +#ifndef NDEBUG /* Poision unused mfn[]s with INVALID_MFN. */
> +    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )
> +    {
> +        ASSERT(mfn_x(*mfn) == 0);
> +        *mfn++ = INVALID_MFN;
> +    }
> +#endif
> +
> +    return mapping;
> +
> + unhandleable:
> +    err = ERR_PTR(~(long)X86EMUL_UNHANDLEABLE);
> +
> + out:
> +    /* Drop all held references. */
> +    while ( mfn > hvmemul_ctxt->mfn )
> +        put_page(mfn_to_page(mfn_x(*mfn--)));

ITYM

    while ( mfn-- > hvmemul_ctxt->mfn )
        put_page(mfn_to_page(mfn_x(*mfn)));

or

    while ( mfn > hvmemul_ctxt->mfn )
        put_page(mfn_to_page(mfn_x(*--mfn)));

> +static void hvmemul_unmap_linear_addr(
> +    void *mapping, unsigned long linear, unsigned int bytes,

Both vunmap() and unmap_domain_page() take pointers to const, so
please use const on the pointer here too.

> +    struct hvm_emulate_ctxt *hvmemul_ctxt)

There upsides and downsides to requiring the caller to pass in the
same values as to map(): You can do more correctness checking
here, but you also risk the caller using the wrong values (perhaps
because of a meanwhile updated local variable). While I don't
outright object to this approach, personally I'd prefer minimal
inputs here, and the code deriving everything from hvmemul_ctxt.

> @@ -1007,23 +1160,15 @@ static int hvmemul_write(
>           (vio->mmio_gla == (addr & PAGE_MASK)) )
>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 1);
>  
> -    rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec, &pfinfo);
> -
> -    switch ( rc )
> -    {
> -    case HVMTRANS_okay:
> -        break;
> -    case HVMTRANS_bad_linear_to_gfn:
> -        x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
> -        return X86EMUL_EXCEPTION;
> -    case HVMTRANS_bad_gfn_to_mfn:
> +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
> +    if ( IS_ERR(mapping) )
> +        return ~PTR_ERR(mapping);
> +    else if ( !mapping )

Pointless "else".

>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 0);

Considering the 2nd linear -> guest-phys translation done here, did
you consider having hvmemul_map_linear_addr() obtain and provide
the GFNs?

> --- a/xen/include/asm-x86/hvm/emulate.h
> +++ b/xen/include/asm-x86/hvm/emulate.h
> @@ -37,6 +37,13 @@ struct hvm_emulate_ctxt {
>      unsigned long seg_reg_accessed;
>      unsigned long seg_reg_dirty;
>  
> +    /*
> +     * MFNs behind temporary mappings in the write callback.  The length is
> +     * arbitrary, and can be increased if writes longer than PAGE_SIZE are
> +     * needed.
> +     */
> +    mfn_t mfn[2];

Mind being precise in the comment, saying "PAGE_SIZE+1"?

Jan
Andrew Cooper July 3, 2017, 3:07 p.m. UTC | #3
On 22/06/17 10:06, Jan Beulich wrote:
>
>> +    /*
>> +     * mfn points to the next free slot.  All used slots have a page reference
>> +     * held on them.
>> +     */
>> +    mfn_t *mfn = &hvmemul_ctxt->mfn[0];
>> +
>> +    /*
>> +     * The caller has no legitimate reason for trying a zero-byte write, but
>> +     * final is calculate to fail safe in release builds.
>> +     *
>> +     * The maximum write size depends on the number of adjacent mfns[] which
>> +     * can be vmap()'d, accouting for possible misalignment within the region.
>> +     * The higher level emulation callers are responsible for ensuring that
>> +     * mfns[] is large enough for the requested write size.
>> +     */
>> +    if ( bytes == 0 ||
>> +         final - first > ARRAY_SIZE(hvmemul_ctxt->mfn) - 1 )
> Any reason not to use ">= ARRAY_SIZE(hvmemul_ctxt->mfn)"

It more obviously identifies the accounting for misalignment.

>
>> +    {
>> +        ASSERT_UNREACHABLE();
>> +        goto unhandleable;
>> +    }
>> +
>> +    do {
>> +        enum hvm_translation_result res;
>> +        struct page_info *page;
>> +        pagefault_info_t pfinfo;
>> +        p2m_type_t p2mt;
>> +
>> +        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT, true, pfec,
>> +                                     &pfinfo, &page, NULL, &p2mt);
>> +
>> +        switch ( res )
>> +        {
>> +        case HVMTRANS_okay:
>> +            break;
>> +
>> +        case HVMTRANS_bad_linear_to_gfn:
>> +            x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
>> +            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);
>> +            goto out;
>> +
>> +        case HVMTRANS_bad_gfn_to_mfn:
>> +            err = NULL;
>> +            goto out;
>> +
>> +        case HVMTRANS_gfn_paged_out:
>> +        case HVMTRANS_gfn_shared:
>> +            err = ERR_PTR(~(long)X86EMUL_RETRY);
>> +            goto out;
>> +
>> +        default:
>> +            goto unhandleable;
>> +        }
>> +
>> +        /* Error checking.  Confirm that the current slot is clean. */
>> +        ASSERT(mfn_x(*mfn) == 0);
> Wouldn't this better be done first thing in the loop?

IMO its clearer to keep it next to the assignment, but yes, could in
principle move to the top of the loop.

> And wouldn't the value better be INVALID_MFN?

The backing array is zeroed by hvm_emulate_init_once(), so relying on 0
here is more convenient.

Furthermore, INVALID_MFN is used lower down to poison unused slots, so
initialising the whole array to INVALID_MFN reduces the effectiveness of
the checks.

>
>> +        *mfn++ = _mfn(page_to_mfn(page));
>> +        frame++;
>> +
>> +        if ( p2m_is_discard_write(p2mt) )
>> +        {
>> +            err = ERR_PTR(~(long)X86EMUL_OKAY);
>> +            goto out;
> If one page is discard-write and the other isn't, this will end up
> being wrong.

Straddled accesses are always a grey area, and discard-write is an extra
special case which only exists inside Xen.  Discard-write means that the
guest knows that it shouldn't write there at all.

Doing nothing (by logically extending the discard-write restriction over
the entire region) is the least bad option here, IMO.

>> +        goto unhandleable;
>> +
>> +#ifndef NDEBUG /* Poision unused mfn[]s with INVALID_MFN. */
>> +    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )
>> +    {
>> +        ASSERT(mfn_x(*mfn) == 0);
>> +        *mfn++ = INVALID_MFN;
>> +    }
>> +#endif
>> +
>> +    return mapping;
>> +
>> + unhandleable:
>> +    err = ERR_PTR(~(long)X86EMUL_UNHANDLEABLE);
>> +
>> + out:
>> +    /* Drop all held references. */
>> +    while ( mfn > hvmemul_ctxt->mfn )
>> +        put_page(mfn_to_page(mfn_x(*mfn--)));
> ITYM
>
>     while ( mfn-- > hvmemul_ctxt->mfn )
>         put_page(mfn_to_page(mfn_x(*mfn)));
>
> or
>
>     while ( mfn > hvmemul_ctxt->mfn )
>         put_page(mfn_to_page(mfn_x(*--mfn)));

Yes, I do.

>
>> +static void hvmemul_unmap_linear_addr(
>> +    void *mapping, unsigned long linear, unsigned int bytes,
> Both vunmap() and unmap_domain_page() take pointers to const, so
> please use const on the pointer here too.

The meaning of const void *p in C is "this function does not modify the
content pointed to by p".

Both vunmap() and unmap_domain_page() mutate the content being pointed
to, so should not take const pointers.

>
>> +    struct hvm_emulate_ctxt *hvmemul_ctxt)
> There upsides and downsides to requiring the caller to pass in the
> same values as to map(): You can do more correctness checking
> here, but you also risk the caller using the wrong values (perhaps
> because of a meanwhile updated local variable). While I don't
> outright object to this approach, personally I'd prefer minimal
> inputs here, and the code deriving everything from hvmemul_ctxt.

I'm not sure exactly how we might wish to extend this logic.  Are we
ever going to want more than one active mapping at once (perhaps rep
movs emulation across two ram regions)?

The other reason is that in the release builds, even if we stored the
pointer in hvmemul_ctxt, we still cant determine which unmapping
function to use without linear and size.

>
>> @@ -1007,23 +1160,15 @@ static int hvmemul_write(
>>           (vio->mmio_gla == (addr & PAGE_MASK)) )
>>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 1);
>>  
>> -    rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec, &pfinfo);
>> -
>> -    switch ( rc )
>> -    {
>> -    case HVMTRANS_okay:
>> -        break;
>> -    case HVMTRANS_bad_linear_to_gfn:
>> -        x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
>> -        return X86EMUL_EXCEPTION;
>> -    case HVMTRANS_bad_gfn_to_mfn:
>> +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
>> +    if ( IS_ERR(mapping) )
>> +        return ~PTR_ERR(mapping);
>> +    else if ( !mapping )
> Pointless "else".
>
>>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 0);
> Considering the 2nd linear -> guest-phys translation done here, did
> you consider having hvmemul_map_linear_addr() obtain and provide
> the GFNs?

Paul has some plans which should remove this second entry into the MMIO
handling code.

>
>> --- a/xen/include/asm-x86/hvm/emulate.h
>> +++ b/xen/include/asm-x86/hvm/emulate.h
>> @@ -37,6 +37,13 @@ struct hvm_emulate_ctxt {
>>      unsigned long seg_reg_accessed;
>>      unsigned long seg_reg_dirty;
>>  
>> +    /*
>> +     * MFNs behind temporary mappings in the write callback.  The length is
>> +     * arbitrary, and can be increased if writes longer than PAGE_SIZE are
>> +     * needed.
>> +     */
>> +    mfn_t mfn[2];
> Mind being precise in the comment, saying "PAGE_SIZE+1"?

While that is strictly true, it is not the behaviour which the map()
function takes.  I don't think it is worth the overhead of fixing that
boundary condition for one extra byte, at which point the documentation
should match the implementation.

~Andrew
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index 384ad0b..804bea5 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -498,6 +498,159 @@  static int hvmemul_do_mmio_addr(paddr_t mmio_gpa,
 }
 
 /*
+ * Map the frame(s) covering an individual linear access, for writeable
+ * access.  May return NULL for MMIO, or ERR_PTR(~X86EMUL_*) for other errors
+ * including ERR_PTR(~X86EMUL_OKAY) for write-discard mappings.
+ *
+ * In debug builds, map() checks that each slot in hvmemul_ctxt->mfn[] is
+ * clean before use, and poisions unused slots with INVALID_MFN.
+ */
+static void *hvmemul_map_linear_addr(
+    unsigned long linear, unsigned int bytes, uint32_t pfec,
+    struct hvm_emulate_ctxt *hvmemul_ctxt)
+{
+    struct vcpu *curr = current;
+    void *err, *mapping;
+
+    /* First and final gfns which need mapping. */
+    unsigned long frame = linear >> PAGE_SHIFT, first = frame;
+    unsigned long final = (linear + bytes - !!bytes) >> PAGE_SHIFT;
+
+    /*
+     * mfn points to the next free slot.  All used slots have a page reference
+     * held on them.
+     */
+    mfn_t *mfn = &hvmemul_ctxt->mfn[0];
+
+    /*
+     * The caller has no legitimate reason for trying a zero-byte write, but
+     * final is calculate to fail safe in release builds.
+     *
+     * The maximum write size depends on the number of adjacent mfns[] which
+     * can be vmap()'d, accouting for possible misalignment within the region.
+     * The higher level emulation callers are responsible for ensuring that
+     * mfns[] is large enough for the requested write size.
+     */
+    if ( bytes == 0 ||
+         final - first > ARRAY_SIZE(hvmemul_ctxt->mfn) - 1 )
+    {
+        ASSERT_UNREACHABLE();
+        goto unhandleable;
+    }
+
+    do {
+        enum hvm_translation_result res;
+        struct page_info *page;
+        pagefault_info_t pfinfo;
+        p2m_type_t p2mt;
+
+        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT, true, pfec,
+                                     &pfinfo, &page, NULL, &p2mt);
+
+        switch ( res )
+        {
+        case HVMTRANS_okay:
+            break;
+
+        case HVMTRANS_bad_linear_to_gfn:
+            x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
+            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);
+            goto out;
+
+        case HVMTRANS_bad_gfn_to_mfn:
+            err = NULL;
+            goto out;
+
+        case HVMTRANS_gfn_paged_out:
+        case HVMTRANS_gfn_shared:
+            err = ERR_PTR(~(long)X86EMUL_RETRY);
+            goto out;
+
+        default:
+            goto unhandleable;
+        }
+
+        /* Error checking.  Confirm that the current slot is clean. */
+        ASSERT(mfn_x(*mfn) == 0);
+
+        *mfn++ = _mfn(page_to_mfn(page));
+        frame++;
+
+        if ( p2m_is_discard_write(p2mt) )
+        {
+            err = ERR_PTR(~(long)X86EMUL_OKAY);
+            goto out;
+        }
+
+    } while ( frame < final );
+
+    /* Entire access within a single frame? */
+    if ( first == final )
+        mapping = map_domain_page(hvmemul_ctxt->mfn[0]) + (linear & ~PAGE_MASK);
+    /* Multiple frames? Need to vmap(). */
+    else if ( (mapping = vmap(hvmemul_ctxt->mfn,
+                              mfn - hvmemul_ctxt->mfn)) == NULL )
+        goto unhandleable;
+
+#ifndef NDEBUG /* Poision unused mfn[]s with INVALID_MFN. */
+    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )
+    {
+        ASSERT(mfn_x(*mfn) == 0);
+        *mfn++ = INVALID_MFN;
+    }
+#endif
+
+    return mapping;
+
+ unhandleable:
+    err = ERR_PTR(~(long)X86EMUL_UNHANDLEABLE);
+
+ out:
+    /* Drop all held references. */
+    while ( mfn > hvmemul_ctxt->mfn )
+        put_page(mfn_to_page(mfn_x(*mfn--)));
+
+    return err;
+}
+
+static void hvmemul_unmap_linear_addr(
+    void *mapping, unsigned long linear, unsigned int bytes,
+    struct hvm_emulate_ctxt *hvmemul_ctxt)
+{
+    struct domain *currd = current->domain;
+    unsigned long frame = linear >> PAGE_SHIFT;
+    unsigned long final = (linear + bytes - !!bytes) >> PAGE_SHIFT;
+    mfn_t *mfn = &hvmemul_ctxt->mfn[0];
+
+    ASSERT(bytes > 0);
+
+    if ( frame == final )
+        unmap_domain_page(mapping);
+    else
+        vunmap(mapping);
+
+    do
+    {
+        ASSERT(mfn_valid(*mfn));
+        paging_mark_dirty(currd, *mfn);
+        put_page(mfn_to_page(mfn_x(*mfn)));
+
+        frame++;
+        *mfn++ = _mfn(0); /* Clean slot for map()'s error checking. */
+
+    } while ( frame < final );
+
+
+#ifndef NDEBUG /* Check (and clean) all unused mfns. */
+    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )
+    {
+        ASSERT(mfn_eq(*mfn, INVALID_MFN));
+        *mfn++ = _mfn(0);
+    }
+#endif
+}
+
+/*
  * Convert addr from linear to physical form, valid over the range
  * [addr, addr + *reps * bytes_per_rep]. *reps is adjusted according to
  * the valid computed range. It is always >0 when X86EMUL_OKAY is returned.
@@ -987,11 +1140,11 @@  static int hvmemul_write(
     struct hvm_emulate_ctxt *hvmemul_ctxt =
         container_of(ctxt, struct hvm_emulate_ctxt, ctxt);
     struct vcpu *curr = current;
-    pagefault_info_t pfinfo;
     unsigned long addr, reps = 1;
     uint32_t pfec = PFEC_page_present | PFEC_write_access;
     struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
     int rc;
+    void *mapping;
 
     if ( is_x86_system_segment(seg) )
         pfec |= PFEC_implicit;
@@ -1007,23 +1160,15 @@  static int hvmemul_write(
          (vio->mmio_gla == (addr & PAGE_MASK)) )
         return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 1);
 
-    rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec, &pfinfo);
-
-    switch ( rc )
-    {
-    case HVMTRANS_okay:
-        break;
-    case HVMTRANS_bad_linear_to_gfn:
-        x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
-        return X86EMUL_EXCEPTION;
-    case HVMTRANS_bad_gfn_to_mfn:
+    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+    if ( IS_ERR(mapping) )
+        return ~PTR_ERR(mapping);
+    else if ( !mapping )
         return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 0);
-    case HVMTRANS_gfn_paged_out:
-    case HVMTRANS_gfn_shared:
-        return X86EMUL_RETRY;
-    default:
-        return X86EMUL_UNHANDLEABLE;
-    }
+
+    memcpy(mapping, p_data, bytes);
+
+    hvmemul_unmap_linear_addr(mapping, addr, bytes, hvmemul_ctxt);
 
     return X86EMUL_OKAY;
 }
diff --git a/xen/include/asm-x86/hvm/emulate.h b/xen/include/asm-x86/hvm/emulate.h
index 8864775..65efd4e 100644
--- a/xen/include/asm-x86/hvm/emulate.h
+++ b/xen/include/asm-x86/hvm/emulate.h
@@ -37,6 +37,13 @@  struct hvm_emulate_ctxt {
     unsigned long seg_reg_accessed;
     unsigned long seg_reg_dirty;
 
+    /*
+     * MFNs behind temporary mappings in the write callback.  The length is
+     * arbitrary, and can be increased if writes longer than PAGE_SIZE are
+     * needed.
+     */
+    mfn_t mfn[2];
+
     uint32_t intr_shadow;
 
     bool_t set_context;