[v1,4/7] x86/vmx: add do_vmtrace_op

Message ID	34833328.8766172.1592320926648.JavaMail.zimbra@cert.pl (mailing list archive)
State	Superseded
Headers	show Return-Path: <SRS0=3Xfx=75=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 109442082F Date: Tue, 16 Jun 2020 17:22:06 +0200 (CEST) From: =?utf-8?q?Micha=C5=82_Leszczy=C5=84ski?= <michal.leszczynski@cert.pl> To: Xen-devel <xen-devel@lists.xenproject.org> Message-ID: <34833328.8766172.1592320926648.JavaMail.zimbra@cert.pl> In-Reply-To: <1548605014.8764792.1592320576239.JavaMail.zimbra@cert.pl> References: <1548605014.8764792.1592320576239.JavaMail.zimbra@cert.pl> Subject: [PATCH v1 4/7] x86/vmx: add do_vmtrace_op MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Thread-Topic: x86/vmx: add do_vmtrace_op Thread-Index: KAn5ItxMsuAqHW3ZzkheyNf1oni9hhQhQ0/T Precedence: list Cc: Stefano Stabellini <sstabellini@kernel.org>, Julien Grall <julien@xen.org>, Wei Liu <wl@xen.org>, Andrew Cooper <andrew.cooper3@citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, George Dunlap <george.dunlap@citrix.com>, Jan Beulich <jbeulich@suse.com>, Roger Pau =?utf-8?b?TW9ubsOp?= <roger.pau@citrix.com> Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	Implement support for external IPT monitoring \| expand [v1,0/7] Implement support for external IPT monitoring [v1,1/7] x86/vmx: add Intel PT MSR definitions [v1,2/7] x86/vmx: add IPT cpu feature [v1,3/7] x86/vmx: add ipt_state as part of vCPU state [v1,4/7] x86/vmx: add do_vmtrace_op [v1,5/7] tools/libxc: add xc_ptbuf_* functions [v1,6/7] tools/proctrace: add proctrace tool [v1,7/7] x86/vmx: switch IPT MSRs on vmentry/vmexit

Michał Leszczyński June 16, 2020, 3:22 p.m. UTC

Provide an interface for privileged domains to manage
external IPT monitoring.

Signed-off-by: Michal Leszczynski <michal.leszczynski@cert.pl>
---
 xen/arch/x86/hvm/hvm.c          | 170 ++++++++++++++++++++++++++++++++
 xen/include/public/hvm/hvm_op.h |  27 +++++
 2 files changed, 197 insertions(+)

Roger Pau Monne June 16, 2020, 5:23 p.m. UTC | #1

On Tue, Jun 16, 2020 at 05:22:06PM +0200, Michał Leszczyński wrote:
> Provide an interface for privileged domains to manage
> external IPT monitoring.
> 
> Signed-off-by: Michal Leszczynski <michal.leszczynski@cert.pl>

Thanks for the patch! I have some questions below which require your
input.

> ---
>  xen/arch/x86/hvm/hvm.c          | 170 ++++++++++++++++++++++++++++++++
>  xen/include/public/hvm/hvm_op.h |  27 +++++
>  2 files changed, 197 insertions(+)
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 5bb47583b3..9292caebe0 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -4949,6 +4949,172 @@ static int compat_altp2m_op(
>      return rc;
>  }
>  
> +static int do_vmtrace_op(
> +    XEN_GUEST_HANDLE_PARAM(void) arg)

No need for the newline, this can fit on a single line.

> +{
> +    struct xen_hvm_vmtrace_op a;
> +    struct domain *d = NULL;

I don't think you need to init d to NULL (at least by looking at the
current code below).

> +    int rc = -EFAULT;

No need to init rc.

> +    int i;

unsigned since it's used as a loop counter.

> +    struct vcpu *v;
> +    void* buf;

Nit: '*' should be prepended to the variable name.

> +    uint32_t buf_size;

size_t

> +    uint32_t buf_order;

Order is generally fine using unsigned int, no need to use a
specifically sized type.

> +    uint64_t buf_mfn;

Could this use the mfn type?

> +    struct page_info *pg;
> +
> +    if ( !hvm_ipt_supported() )
> +        return -EOPNOTSUPP;
> +
> +    if ( copy_from_guest(&a, arg, 1) )
> +        return -EFAULT;
> +
> +    if ( a.version != HVMOP_VMTRACE_INTERFACE_VERSION )
> +        return -EINVAL;
> +
> +    switch ( a.cmd )
> +    {
> +    case HVMOP_vmtrace_ipt_enable:
> +    case HVMOP_vmtrace_ipt_disable:
> +    case HVMOP_vmtrace_ipt_get_buf:
> +    case HVMOP_vmtrace_ipt_get_offset:
> +        break;
> +
> +    default:
> +        return -EOPNOTSUPP;
> +    }
> +
> +    d = rcu_lock_domain_by_any_id(a.domain);
> +
> +    if ( d == NULL )
> +        return -ESRCH;
> +
> +    if ( !is_hvm_domain(d) )
> +    {
> +        rc = -EOPNOTSUPP;
> +        goto out;
> +    }
> +
> +    domain_pause(d);
> +
> +    if ( a.vcpu >= d->max_vcpus )
> +    {
> +        rc = -EINVAL;
> +        goto out;
> +    }
> +
> +    v = d->vcpu[a.vcpu];
> +
> +    if ( a.cmd == HVMOP_vmtrace_ipt_enable )

Please use a switch here, you might even consider re-using the switch
from above and moving the domain checks before actually checking the
command field, so that you don't need to perform two switches against
a.cmd.

> +    {
> +        if ( v->arch.hvm.vmx.ipt_state ) {

Coding style, brace should be on newline (there are more below which
I'm not going to comment on).

> +            // already enabled

Comments should use /* ... */, there multiple instances of this below
which I'm not going to comment on, please check CODING_STYLE.

Also, the interface looks racy, I think you are missing a lock to
protect v->arch.hvm.vmx.ipt_state from being freed under your feet if
you issue concurrent calls to the interface.

> +            rc = -EINVAL;
> +            goto out;
> +        }
> +
> +        if ( a.size < PAGE_SIZE || a.size > 1000000 * PAGE_SIZE ) {

You can use GB(4) which is easier to read. Should the size also be a
multiple of a PAGE_SIZE?

> +            // we don't accept trace buffer size smaller than single page
> +            // and the upper bound is defined as 4GB in the specification
> +            rc = -EINVAL;
> +            goto out;
> +	}

Stray tab.

> +
> +        buf_order = get_order_from_bytes(a.size);
> +
> +        if ( (a.size >> PAGE_SHIFT) != (1 << buf_order) ) {

Oh here is the check. I think you can move this with the checks above
by doing a.size & ~PAGE_MASK.

> +            rc = -EINVAL;
> +            goto out;
> +        }
> +
> +        buf = page_to_virt(alloc_domheap_pages(d, buf_order, MEMF_no_refcount));

What if alloc_domheap_pages return NULL?

Since I think you only what the linear address of the page to zero it
I would suggest using clear_domain_page.

> +        buf_size = a.size;
> +
> +        if ( !buf ) {
> +            rc = -EFAULT;
> +            goto out;
> +        }
> +
> +        memset(buf, 0, buf_size);
> +
> +        for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ ) {
> +            share_xen_page_with_privileged_guests(virt_to_page(buf) + i, SHARE_ro);

This line (and some more below) exceeds 80 characters, please split
it.

> +        }
> +
> +        v->arch.hvm.vmx.ipt_state = xmalloc(struct ipt_state);

You should check that xmalloc has succeeds before trying to access
ipt_state.

> +        v->arch.hvm.vmx.ipt_state->output_base = virt_to_mfn(buf) << PAGE_SHIFT;
> +        v->arch.hvm.vmx.ipt_state->output_mask = buf_size - 1;
> +        v->arch.hvm.vmx.ipt_state->status = 0;
> +        v->arch.hvm.vmx.ipt_state->ctl = RTIT_CTL_TRACEEN | RTIT_CTL_OS | RTIT_CTL_USR | RTIT_CTL_BRANCH_EN;

Shouldn't the user be able to select what tracing should be enabled?

> +    }
> +    else if ( a.cmd == HVMOP_vmtrace_ipt_disable )
> +    {
> +        if ( !v->arch.hvm.vmx.ipt_state ) {
> +            rc = -EINVAL;
> +            goto out;
> +        }
> +
> +        buf_mfn = v->arch.hvm.vmx.ipt_state->output_base >> PAGE_SHIFT;
> +        buf_size = ( v->arch.hvm.vmx.ipt_state->output_mask + 1 ) & 0xFFFFFFFFUL;
> +
> +        for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ )
> +        {
> +            if ( (mfn_to_page(_mfn(buf_mfn + i))->count_info & PGC_count_mask) != 1 )
> +            {
> +                rc = -EBUSY;
> +                goto out;
> +            }
> +        }
> +
> +        xfree(v->arch.hvm.vmx.ipt_state);
> +	v->arch.hvm.vmx.ipt_state = NULL;
> +
> +        for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ )
> +        {
> +            pg = mfn_to_page(_mfn(buf_mfn + i));
> +            put_page_alloc_ref(pg);
> +            if ( !test_and_clear_bit(_PGC_xen_heap, &pg->count_info) )
> +                ASSERT_UNREACHABLE();
> +            pg->u.inuse.type_info = 0;
> +            page_set_owner(pg, NULL);
> +            free_domheap_page(pg);

Hm, this seems fairly dangerous, what guarantees that the caller is
not going to map the buffer while you are trying to tear it down?

You perform a check before freeing ipt_state, but between the check
and the actual tearing down the domain might have setup mappings to
them.

I wonder, could you expand a bit on why trace buffers are allocated
from domheap memory by Xen?

There are a couple of options here, maybe the caller could provide
it's own buffer, then Xen would take an extra reference to those pages
and setup them to be used as buffers.

Another alternative would be to use domhep memory but not let the
caller map it directly, and instead introduce a hypercall to copy
from the internal Xen buffer into a user-provided one.

How much memory is used on average by those buffers? That would help
decide a model that would best fit the usage.

> +        }
> +    }
> +    else if ( a.cmd == HVMOP_vmtrace_ipt_get_buf )
> +    {
> +        if ( !v->arch.hvm.vmx.ipt_state ) {
> +            rc = -EINVAL;
> +            goto out;
> +        }
> +
> +        a.mfn = v->arch.hvm.vmx.ipt_state->output_base >> PAGE_SHIFT;

This will not work for translated domains, ie: a PVH or HVM domain
won't be able to use this interface since it has no way to request the
mapping of a specific mfn into it's physmap. I think we need to take
this into account when deciding how the interface should be, so that
we don't corner ourselves with a PV only interface.

> +        a.size = (v->arch.hvm.vmx.ipt_state->output_mask + 1) & 0xFFFFFFFFUL;

You can truncate it easier by casting to uint32_t I think.

Or even better, you could put output_mask in a union like:

union {
    uint64_t raw;
    struct {
        uint32_t size;
	uint32_t offset;
    }
}

Then you can avoid the shifting and the castings.

> +    }
> +    else if ( a.cmd == HVMOP_vmtrace_ipt_get_offset )
> +    {
> +        if ( !v->arch.hvm.vmx.ipt_state ) {
> +            rc = -EINVAL;
> +            goto out;
> +        }
> +
> +        a.offset = v->arch.hvm.vmx.ipt_state->output_mask >> 32;
> +    }
> +
> +    rc = -EFAULT;
> +    if ( __copy_to_guest(arg, &a, 1) )
> +      goto out;
> +    rc = 0;
> +
> + out:
> +    smp_wmb();

Why do you need a barrier here?

> +    domain_unpause(d);
> +    rcu_unlock_domain(d);
> +
> +    return rc;
> +}
> +
> +DEFINE_XEN_GUEST_HANDLE(compat_hvm_vmtrace_op_t);
> +
>  static int hvmop_get_mem_type(
>      XEN_GUEST_HANDLE_PARAM(xen_hvm_get_mem_type_t) arg)
>  {
> @@ -5101,6 +5267,10 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg)
>          rc = current->hcall_compat ? compat_altp2m_op(arg) : do_altp2m_op(arg);
>          break;
>  
> +    case HVMOP_vmtrace:
> +        rc = do_vmtrace_op(arg);
> +        break;
> +
>      default:
>      {
>          gdprintk(XENLOG_DEBUG, "Bad HVM op %ld.\n", op);
> diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
> index 870ec52060..3bbcd54c96 100644
> --- a/xen/include/public/hvm/hvm_op.h
> +++ b/xen/include/public/hvm/hvm_op.h
> @@ -382,6 +382,33 @@ struct xen_hvm_altp2m_op {
>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>  
> +/* HVMOP_vmtrace: Perform VM tracing related operation */
> +#define HVMOP_vmtrace 26
> +
> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x00000001
> +
> +struct xen_hvm_vmtrace_op {
> +    /* IN variable */
> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
> +    uint32_t cmd;
> +/* Enable/disable external vmtrace for given domain */
> +#define HVMOP_vmtrace_ipt_enable      1
> +#define HVMOP_vmtrace_ipt_disable     2
> +#define HVMOP_vmtrace_ipt_get_buf     3
> +#define HVMOP_vmtrace_ipt_get_offset  4
> +    domid_t domain;

You are missing a padding field here AFAICT.

Roger.

Michał Leszczyński June 17, 2020, 7:13 p.m. UTC | #2

----- 16 cze 2020 o 19:23, Roger Pau Monné roger.pau@citrix.com napisał(a):

> On Tue, Jun 16, 2020 at 05:22:06PM +0200, Michał Leszczyński wrote:
>> Provide an interface for privileged domains to manage
>> external IPT monitoring.
>> 
>> Signed-off-by: Michal Leszczynski <michal.leszczynski@cert.pl>
> 
> Thanks for the patch! I have some questions below which require your
> input.
> 
>> ---
>>  xen/arch/x86/hvm/hvm.c          | 170 ++++++++++++++++++++++++++++++++
>>  xen/include/public/hvm/hvm_op.h |  27 +++++
>>  2 files changed, 197 insertions(+)
>> 
>> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
>> index 5bb47583b3..9292caebe0 100644
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -4949,6 +4949,172 @@ static int compat_altp2m_op(
>>      return rc;
>>  }
>>  
>> +static int do_vmtrace_op(
>> +    XEN_GUEST_HANDLE_PARAM(void) arg)
> 
> No need for the newline, this can fit on a single line.
> 
>> +{
>> +    struct xen_hvm_vmtrace_op a;
>> +    struct domain *d = NULL;
> 
> I don't think you need to init d to NULL (at least by looking at the
> current code below).
> 
>> +    int rc = -EFAULT;
> 
> No need to init rc.
> 
>> +    int i;
> 
> unsigned since it's used as a loop counter.
> 
>> +    struct vcpu *v;
>> +    void* buf;
> 
> Nit: '*' should be prepended to the variable name.
> 
>> +    uint32_t buf_size;
> 
> size_t
> 
>> +    uint32_t buf_order;
> 
> Order is generally fine using unsigned int, no need to use a
> specifically sized type.
> 
>> +    uint64_t buf_mfn;
> 
> Could this use the mfn type?
> 
>> +    struct page_info *pg;
>> +
>> +    if ( !hvm_ipt_supported() )
>> +        return -EOPNOTSUPP;
>> +
>> +    if ( copy_from_guest(&a, arg, 1) )
>> +        return -EFAULT;
>> +
>> +    if ( a.version != HVMOP_VMTRACE_INTERFACE_VERSION )
>> +        return -EINVAL;
>> +
>> +    switch ( a.cmd )
>> +    {
>> +    case HVMOP_vmtrace_ipt_enable:
>> +    case HVMOP_vmtrace_ipt_disable:
>> +    case HVMOP_vmtrace_ipt_get_buf:
>> +    case HVMOP_vmtrace_ipt_get_offset:
>> +        break;
>> +
>> +    default:
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    d = rcu_lock_domain_by_any_id(a.domain);
>> +
>> +    if ( d == NULL )
>> +        return -ESRCH;
>> +
>> +    if ( !is_hvm_domain(d) )
>> +    {
>> +        rc = -EOPNOTSUPP;
>> +        goto out;
>> +    }
>> +
>> +    domain_pause(d);
>> +
>> +    if ( a.vcpu >= d->max_vcpus )
>> +    {
>> +        rc = -EINVAL;
>> +        goto out;
>> +    }
>> +
>> +    v = d->vcpu[a.vcpu];
>> +
>> +    if ( a.cmd == HVMOP_vmtrace_ipt_enable )
> 
> Please use a switch here, you might even consider re-using the switch
> from above and moving the domain checks before actually checking the
> command field, so that you don't need to perform two switches against
> a.cmd.
> 
>> +    {
>> +        if ( v->arch.hvm.vmx.ipt_state ) {
> 
> Coding style, brace should be on newline (there are more below which
> I'm not going to comment on).
> 
>> +            // already enabled
> 
> Comments should use /* ... */, there multiple instances of this below
> which I'm not going to comment on, please check CODING_STYLE.
> 
> Also, the interface looks racy, I think you are missing a lock to
> protect v->arch.hvm.vmx.ipt_state from being freed under your feet if
> you issue concurrent calls to the interface.
> 
>> +            rc = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        if ( a.size < PAGE_SIZE || a.size > 1000000 * PAGE_SIZE ) {
> 
> You can use GB(4) which is easier to read. Should the size also be a
> multiple of a PAGE_SIZE?
> 
>> +            // we don't accept trace buffer size smaller than single page
>> +            // and the upper bound is defined as 4GB in the specification
>> +            rc = -EINVAL;
>> +            goto out;
>> +	}
> 
> Stray tab.
> 
>> +
>> +        buf_order = get_order_from_bytes(a.size);
>> +
>> +        if ( (a.size >> PAGE_SHIFT) != (1 << buf_order) ) {
> 
> Oh here is the check. I think you can move this with the checks above
> by doing a.size & ~PAGE_MASK.


I belive it's more strict than a.size & ~PAGE_MASK. I think that CPU expects that the buffer size is a power of 2, so you can have 64 MB or 128 MB, but not 96 MB buffer.


> 
>> +            rc = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        buf = page_to_virt(alloc_domheap_pages(d, buf_order,
>> MEMF_no_refcount));
> 
> What if alloc_domheap_pages return NULL?
> 
> Since I think you only what the linear address of the page to zero it
> I would suggest using clear_domain_page.
> 

Hmm. This was fixed already. Most probably I did something strange with git and this change was not stored. I will correct this with patch v2.


>> +        buf_size = a.size;
>> +
>> +        if ( !buf ) {
>> +            rc = -EFAULT;
>> +            goto out;
>> +        }
>> +
>> +        memset(buf, 0, buf_size);
>> +
>> +        for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ ) {
>> +            share_xen_page_with_privileged_guests(virt_to_page(buf) + i,
>> SHARE_ro);
> 
> This line (and some more below) exceeds 80 characters, please split
> it.
> 
>> +        }
>> +
>> +        v->arch.hvm.vmx.ipt_state = xmalloc(struct ipt_state);
> 
> You should check that xmalloc has succeeds before trying to access
> ipt_state.
> 
>> +        v->arch.hvm.vmx.ipt_state->output_base = virt_to_mfn(buf) <<
>> PAGE_SHIFT;
>> +        v->arch.hvm.vmx.ipt_state->output_mask = buf_size - 1;
>> +        v->arch.hvm.vmx.ipt_state->status = 0;
>> +        v->arch.hvm.vmx.ipt_state->ctl = RTIT_CTL_TRACEEN | RTIT_CTL_OS |
>> RTIT_CTL_USR | RTIT_CTL_BRANCH_EN;
> 
> Shouldn't the user be able to select what tracing should be enabled?
> 
>> +    }
>> +    else if ( a.cmd == HVMOP_vmtrace_ipt_disable )
>> +    {
>> +        if ( !v->arch.hvm.vmx.ipt_state ) {
>> +            rc = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        buf_mfn = v->arch.hvm.vmx.ipt_state->output_base >> PAGE_SHIFT;
>> +        buf_size = ( v->arch.hvm.vmx.ipt_state->output_mask + 1 ) &
>> 0xFFFFFFFFUL;
>> +
>> +        for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ )
>> +        {
>> +            if ( (mfn_to_page(_mfn(buf_mfn + i))->count_info & PGC_count_mask)
>> != 1 )
>> +            {
>> +                rc = -EBUSY;
>> +                goto out;
>> +            }
>> +        }
>> +
>> +        xfree(v->arch.hvm.vmx.ipt_state);
>> +	v->arch.hvm.vmx.ipt_state = NULL;
>> +
>> +        for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ )
>> +        {
>> +            pg = mfn_to_page(_mfn(buf_mfn + i));
>> +            put_page_alloc_ref(pg);
>> +            if ( !test_and_clear_bit(_PGC_xen_heap, &pg->count_info) )
>> +                ASSERT_UNREACHABLE();
>> +            pg->u.inuse.type_info = 0;
>> +            page_set_owner(pg, NULL);
>> +            free_domheap_page(pg);
> 
> Hm, this seems fairly dangerous, what guarantees that the caller is
> not going to map the buffer while you are trying to tear it down?
> 
> You perform a check before freeing ipt_state, but between the check
> and the actual tearing down the domain might have setup mappings to
> them.
> 
> I wonder, could you expand a bit on why trace buffers are allocated
> from domheap memory by Xen?


In general, I thought it would be good to account trace buffers for particular DomUs, so it would be easier to troubleshoot the memory usage.


> 
> There are a couple of options here, maybe the caller could provide
> it's own buffer, then Xen would take an extra reference to those pages
> and setup them to be used as buffers.
> 
> Another alternative would be to use domhep memory but not let the
> caller map it directly, and instead introduce a hypercall to copy
> from the internal Xen buffer into a user-provided one.
> 
> How much memory is used on average by those buffers? That would help
> decide a model that would best fit the usage.


From 4 kB to 4 GB. Right now I use 128 MB buffers and it takes just a few seconds to fill them up completely.

I think I've just copied the pattern which is already present in Xen's code, e.g. interfaces used by xenbaked/xentrace tools.


> 
>> +        }
>> +    }
>> +    else if ( a.cmd == HVMOP_vmtrace_ipt_get_buf )
>> +    {
>> +        if ( !v->arch.hvm.vmx.ipt_state ) {
>> +            rc = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        a.mfn = v->arch.hvm.vmx.ipt_state->output_base >> PAGE_SHIFT;
> 
> This will not work for translated domains, ie: a PVH or HVM domain
> won't be able to use this interface since it has no way to request the
> mapping of a specific mfn into it's physmap. I think we need to take
> this into account when deciding how the interface should be, so that
> we don't corner ourselves with a PV only interface.

Please be aware that this is only going to be used by Dom0. Is is well-supported case that somebody is using PVH/HVM Dom0?

I think that all Virtual Machine Introspection stuff currently requires to have Dom0 PV. Our main goal is to have this working well in combo with VMI.


> 
>> +        a.size = (v->arch.hvm.vmx.ipt_state->output_mask + 1) & 0xFFFFFFFFUL;
> 
> You can truncate it easier by casting to uint32_t I think.
> 
> Or even better, you could put output_mask in a union like:
> 
> union {
>    uint64_t raw;
>    struct {
>        uint32_t size;
>	uint32_t offset;
>    }
> }
> 
> Then you can avoid the shifting and the castings.
> 
>> +    }
>> +    else if ( a.cmd == HVMOP_vmtrace_ipt_get_offset )
>> +    {
>> +        if ( !v->arch.hvm.vmx.ipt_state ) {
>> +            rc = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        a.offset = v->arch.hvm.vmx.ipt_state->output_mask >> 32;
>> +    }
>> +
>> +    rc = -EFAULT;
>> +    if ( __copy_to_guest(arg, &a, 1) )
>> +      goto out;
>> +    rc = 0;
>> +
>> + out:
>> +    smp_wmb();
> 
> Why do you need a barrier here?
> 
>> +    domain_unpause(d);
>> +    rcu_unlock_domain(d);
>> +
>> +    return rc;
>> +}
>> +
>> +DEFINE_XEN_GUEST_HANDLE(compat_hvm_vmtrace_op_t);
>> +
>>  static int hvmop_get_mem_type(
>>      XEN_GUEST_HANDLE_PARAM(xen_hvm_get_mem_type_t) arg)
>>  {
>> @@ -5101,6 +5267,10 @@ long do_hvm_op(unsigned long op,
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>          rc = current->hcall_compat ? compat_altp2m_op(arg) : do_altp2m_op(arg);
>>          break;
>>  
>> +    case HVMOP_vmtrace:
>> +        rc = do_vmtrace_op(arg);
>> +        break;
>> +
>>      default:
>>      {
>>          gdprintk(XENLOG_DEBUG, "Bad HVM op %ld.\n", op);
>> diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
>> index 870ec52060..3bbcd54c96 100644
>> --- a/xen/include/public/hvm/hvm_op.h
>> +++ b/xen/include/public/hvm/hvm_op.h
>> @@ -382,6 +382,33 @@ struct xen_hvm_altp2m_op {
>>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>>  
>> +/* HVMOP_vmtrace: Perform VM tracing related operation */
>> +#define HVMOP_vmtrace 26
>> +
>> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x00000001
>> +
>> +struct xen_hvm_vmtrace_op {
>> +    /* IN variable */
>> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
>> +    uint32_t cmd;
>> +/* Enable/disable external vmtrace for given domain */
>> +#define HVMOP_vmtrace_ipt_enable      1
>> +#define HVMOP_vmtrace_ipt_disable     2
>> +#define HVMOP_vmtrace_ipt_get_buf     3
>> +#define HVMOP_vmtrace_ipt_get_offset  4
>> +    domid_t domain;
> 
> You are missing a padding field here AFAICT.
> 
> Roger.


Thanks for your feedback, I will apply all the remaining suggestions in patch v2.

Best regards,
Michał Leszczyński
CERT Polska

Tamas K Lengyel June 18, 2020, 3:20 a.m. UTC | #3

> >> +
> >> +        a.mfn = v->arch.hvm.vmx.ipt_state->output_base >> PAGE_SHIFT;
> >
> > This will not work for translated domains, ie: a PVH or HVM domain
> > won't be able to use this interface since it has no way to request the
> > mapping of a specific mfn into it's physmap. I think we need to take
> > this into account when deciding how the interface should be, so that
> > we don't corner ourselves with a PV only interface.
>
> Please be aware that this is only going to be used by Dom0. Is is well-supported case that somebody is using PVH/HVM Dom0?
>
> I think that all Virtual Machine Introspection stuff currently requires to have Dom0 PV. Our main goal is to have this working well in combo with VMI.

FYI the VMI interface doesn't require a PV domain. It works fine from
PVH dom0 or even from a secondary privileged HVM DomU as well,
provided you have the right XSM policy to allow that.

Tamas

Roger Pau Monne June 18, 2020, 8:46 a.m. UTC | #4

On Wed, Jun 17, 2020 at 09:13:05PM +0200, Michał Leszczyński wrote:
> ----- 16 cze 2020 o 19:23, Roger Pau Monné roger.pau@citrix.com napisał(a):
> 
> > On Tue, Jun 16, 2020 at 05:22:06PM +0200, Michał Leszczyński wrote:
> >> +        buf_order = get_order_from_bytes(a.size);
> >> +
> >> +        if ( (a.size >> PAGE_SHIFT) != (1 << buf_order) ) {
> > 
> > Oh here is the check. I think you can move this with the checks above
> > by doing a.size & ~PAGE_MASK.
> 
> 
> I belive it's more strict than a.size & ~PAGE_MASK. I think that CPU expects that the buffer size is a power of 2, so you can have 64 MB or 128 MB, but not 96 MB buffer.

Oh, sorry, didn't realize. I think it's clearer to check if a size is
not a power of two by doing (size & (size - 1)). This could be joined
with the previous checks.

> > There are a couple of options here, maybe the caller could provide
> > it's own buffer, then Xen would take an extra reference to those pages
> > and setup them to be used as buffers.
> > 
> > Another alternative would be to use domhep memory but not let the
> > caller map it directly, and instead introduce a hypercall to copy
> > from the internal Xen buffer into a user-provided one.
> > 
> > How much memory is used on average by those buffers? That would help
> > decide a model that would best fit the usage.
> 
> 
> From 4 kB to 4 GB. Right now I use 128 MB buffers and it takes just a few seconds to fill them up completely.
> 
> I think I've just copied the pattern which is already present in Xen's code, e.g. interfaces used by xenbaked/xentrace tools.

I think using XENMEM_acquire_resource will result in cleaner code
overall, it would also avoid having to share the pages with Xen
AFAICT. It's also more inline with how new interfaces deal with this
kind of memory sharing.

Roger.

Michał Leszczyński June 18, 2020, 11:01 a.m. UTC | #5

----- 18 cze 2020 o 5:20, Tamas K Lengyel tamas.k.lengyel@gmail.com napisał(a):

>> >> +
>> >> +        a.mfn = v->arch.hvm.vmx.ipt_state->output_base >> PAGE_SHIFT;
>> >
>> > This will not work for translated domains, ie: a PVH or HVM domain
>> > won't be able to use this interface since it has no way to request the
>> > mapping of a specific mfn into it's physmap. I think we need to take
>> > this into account when deciding how the interface should be, so that
>> > we don't corner ourselves with a PV only interface.
>>
>> Please be aware that this is only going to be used by Dom0. Is is well-supported
>> case that somebody is using PVH/HVM Dom0?
>>
>> I think that all Virtual Machine Introspection stuff currently requires to have
>> Dom0 PV. Our main goal is to have this working well in combo with VMI.
> 
> FYI the VMI interface doesn't require a PV domain. It works fine from
> PVH dom0 or even from a secondary privileged HVM DomU as well,
> provided you have the right XSM policy to allow that.
> 
> Tamas


It was previously stated that:

> PVH or HVM domain
> won't be able to use this interface since it has no way to request the
> mapping of a specific mfn into it's physmap.

but however, taking LibVMI as an example:

https://github.com/libvmi/libvmi/blob/c461e20ae88bc62c08c27f50fcead23c27a30f9e/libvmi/driver/xen/xen.c#L51

An essential abstraction xen_get_memory() relies on xc_map_foreign_range(). Doesn't this mean that it's not usable from PVH or HVM domains, or did I got it all wrong?


Best regards,
Michał Leszczyński
CERT Polska

Roger Pau Monne June 18, 2020, 11:55 a.m. UTC | #6

On Thu, Jun 18, 2020 at 01:01:39PM +0200, Michał Leszczyński wrote:
> ----- 18 cze 2020 o 5:20, Tamas K Lengyel tamas.k.lengyel@gmail.com napisał(a):
> 
> >> >> +
> >> >> +        a.mfn = v->arch.hvm.vmx.ipt_state->output_base >> PAGE_SHIFT;
> >> >
> >> > This will not work for translated domains, ie: a PVH or HVM domain
> >> > won't be able to use this interface since it has no way to request the
> >> > mapping of a specific mfn into it's physmap. I think we need to take
> >> > this into account when deciding how the interface should be, so that
> >> > we don't corner ourselves with a PV only interface.
> >>
> >> Please be aware that this is only going to be used by Dom0. Is is well-supported
> >> case that somebody is using PVH/HVM Dom0?
> >>
> >> I think that all Virtual Machine Introspection stuff currently requires to have
> >> Dom0 PV. Our main goal is to have this working well in combo with VMI.
> > 
> > FYI the VMI interface doesn't require a PV domain. It works fine from
> > PVH dom0 or even from a secondary privileged HVM DomU as well,
> > provided you have the right XSM policy to allow that.
> > 
> > Tamas
> 
> 
> It was previously stated that:
> 
> > PVH or HVM domain
> > won't be able to use this interface since it has no way to request the
> > mapping of a specific mfn into it's physmap.
> 
> but however, taking LibVMI as an example:
> 
> https://github.com/libvmi/libvmi/blob/c461e20ae88bc62c08c27f50fcead23c27a30f9e/libvmi/driver/xen/xen.c#L51
> 
> An essential abstraction xen_get_memory() relies on xc_map_foreign_range(). Doesn't this mean that it's not usable from PVH or HVM domains, or did I got it all wrong?

That was my fault, so the buffer mfns are assigned to Xen, and then
the Xen domain ID is used to map those, which should work on both PV
and HVM (or PVH).

I still think using XENMEM_acquire_resource might be better, but I
would let others comment.

Roger.

Jan Beulich June 18, 2020, 12:51 p.m. UTC | #7

On 18.06.2020 13:55, Roger Pau Monné wrote:
> On Thu, Jun 18, 2020 at 01:01:39PM +0200, Michał Leszczyński wrote:
>> It was previously stated that:
>>
>>> PVH or HVM domain
>>> won't be able to use this interface since it has no way to request the
>>> mapping of a specific mfn into it's physmap.
>>
>> but however, taking LibVMI as an example:
>>
>> https://github.com/libvmi/libvmi/blob/c461e20ae88bc62c08c27f50fcead23c27a30f9e/libvmi/driver/xen/xen.c#L51
>>
>> An essential abstraction xen_get_memory() relies on xc_map_foreign_range(). Doesn't this mean that it's not usable from PVH or HVM domains, or did I got it all wrong?
> 
> That was my fault, so the buffer mfns are assigned to Xen, and then
> the Xen domain ID is used to map those, which should work on both PV
> and HVM (or PVH).
> 
> I still think using XENMEM_acquire_resource might be better, but I
> would let others comment.

+1

Jan

Michał Leszczyński June 18, 2020, 1:09 p.m. UTC | #8

----- 18 cze 2020 o 14:51, Jan Beulich jbeulich@suse.com napisał(a):

> On 18.06.2020 13:55, Roger Pau Monné wrote:
>> On Thu, Jun 18, 2020 at 01:01:39PM +0200, Michał Leszczyński wrote:
>>> It was previously stated that:
>>>
>>>> PVH or HVM domain
>>>> won't be able to use this interface since it has no way to request the
>>>> mapping of a specific mfn into it's physmap.
>>>
>>> but however, taking LibVMI as an example:
>>>
>>> https://github.com/libvmi/libvmi/blob/c461e20ae88bc62c08c27f50fcead23c27a30f9e/libvmi/driver/xen/xen.c#L51
>>>
>>> An essential abstraction xen_get_memory() relies on xc_map_foreign_range().
>>> Doesn't this mean that it's not usable from PVH or HVM domains, or did I got it
>>> all wrong?
>> 
>> That was my fault, so the buffer mfns are assigned to Xen, and then
>> the Xen domain ID is used to map those, which should work on both PV
>> and HVM (or PVH).
>> 
>> I still think using XENMEM_acquire_resource might be better, but I
>> would let others comment.
> 
> +1
> 
> Jan


I'm trying to implement this right now. I've added some very simple code to mm.c just for testing:

---

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e376fc7e8f..aaaefe6d23 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4624,6 +4624,26 @@ int arch_acquire_resource(struct domain *d, unsigned int type,
         }
         break;
     }
+
+    case XENMEM_resource_vmtrace_buf:
+    {
+        uint64_t output_base;
+        mfn_t mfn;
+        unsigned int i;
+
+        printk("vmtrace buf acquire\n");
+        output_base = d->vcpu[id]->arch.hvm.vmx.ipt_state->output_base;
+        mfn = mfn_x(output_base >> PAGE_SHIFT);
+
+        rc = 0;
+        for ( i = 0; i < nr_frames; i++ )
+        {
+            __map_domain_page_global(mfn_to_page(mfn + i));
+            mfn_list[i] = mfn + i;
+        }
+
+        break;
+    }
 #endif

     default:

---


and then in my "proctrace" tool I'm trying to acquire it like this:

    fres = xenforeignmemory_map_resource(
        fmem, domid, XENMEM_resource_vmtrace_buf,
        /* vcpu: */ 0, /* frame: */ 0, /* num_frames: */ 128, (void **)&buf,
        PROT_READ, 0);


ioctl fails with "Argument list too long". It works fine when I provide some small number of frames (e.g. num_frames: 1 or 32), but doesn't work for any larger quantity.

How should I proceed with this? The PT buffer could be large, even up to 4 GB.


Best regards,
Michał Leszczyński
CERT Polska

Jan Beulich June 18, 2020, 1:24 p.m. UTC | #9

On 18.06.2020 15:09, Michał Leszczyński wrote:
> ----- 18 cze 2020 o 14:51, Jan Beulich jbeulich@suse.com napisał(a):
> 
>> On 18.06.2020 13:55, Roger Pau Monné wrote:
>>> On Thu, Jun 18, 2020 at 01:01:39PM +0200, Michał Leszczyński wrote:
>>>> It was previously stated that:
>>>>
>>>>> PVH or HVM domain
>>>>> won't be able to use this interface since it has no way to request the
>>>>> mapping of a specific mfn into it's physmap.
>>>>
>>>> but however, taking LibVMI as an example:
>>>>
>>>> https://github.com/libvmi/libvmi/blob/c461e20ae88bc62c08c27f50fcead23c27a30f9e/libvmi/driver/xen/xen.c#L51
>>>>
>>>> An essential abstraction xen_get_memory() relies on xc_map_foreign_range().
>>>> Doesn't this mean that it's not usable from PVH or HVM domains, or did I got it
>>>> all wrong?
>>>
>>> That was my fault, so the buffer mfns are assigned to Xen, and then
>>> the Xen domain ID is used to map those, which should work on both PV
>>> and HVM (or PVH).
>>>
>>> I still think using XENMEM_acquire_resource might be better, but I
>>> would let others comment.
>>
>> +1
>>
>> Jan
> 
> 
> I'm trying to implement this right now. I've added some very simple code to mm.c just for testing:
> 
> ---
> 
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index e376fc7e8f..aaaefe6d23 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4624,6 +4624,26 @@ int arch_acquire_resource(struct domain *d, unsigned int type,
>          }
>          break;
>      }
> +
> +    case XENMEM_resource_vmtrace_buf:
> +    {
> +        uint64_t output_base;
> +        mfn_t mfn;
> +        unsigned int i;
> +
> +        printk("vmtrace buf acquire\n");
> +        output_base = d->vcpu[id]->arch.hvm.vmx.ipt_state->output_base;
> +        mfn = mfn_x(output_base >> PAGE_SHIFT);
> +
> +        rc = 0;
> +        for ( i = 0; i < nr_frames; i++ )
> +        {
> +            __map_domain_page_global(mfn_to_page(mfn + i));
> +            mfn_list[i] = mfn + i;
> +        }
> +
> +        break;
> +    }
>  #endif
> 
>      default:
> 
> ---
> 
> 
> and then in my "proctrace" tool I'm trying to acquire it like this:
> 
>     fres = xenforeignmemory_map_resource(
>         fmem, domid, XENMEM_resource_vmtrace_buf,
>         /* vcpu: */ 0, /* frame: */ 0, /* num_frames: */ 128, (void **)&buf,
>         PROT_READ, 0);
> 
> 
> ioctl fails with "Argument list too long". It works fine when I provide some small number of frames (e.g. num_frames: 1 or 32), but doesn't work for any larger quantity.

See xen/common/memory.c:acquire_resource(). So far larger quantities
weren't needed, so there's an implementation limit of 32 right now.
This can be lifted, but please not by growing the on-stack array (as
the comment there also suggests).

Jan

Roger Pau Monne June 18, 2020, 1:40 p.m. UTC | #10

On Thu, Jun 18, 2020 at 03:09:57PM +0200, Michał Leszczyński wrote:
> ----- 18 cze 2020 o 14:51, Jan Beulich jbeulich@suse.com napisał(a):
> 
> > On 18.06.2020 13:55, Roger Pau Monné wrote:
> >> On Thu, Jun 18, 2020 at 01:01:39PM +0200, Michał Leszczyński wrote:
> >>> It was previously stated that:
> >>>
> >>>> PVH or HVM domain
> >>>> won't be able to use this interface since it has no way to request the
> >>>> mapping of a specific mfn into it's physmap.
> >>>
> >>> but however, taking LibVMI as an example:
> >>>
> >>> https://github.com/libvmi/libvmi/blob/c461e20ae88bc62c08c27f50fcead23c27a30f9e/libvmi/driver/xen/xen.c#L51
> >>>
> >>> An essential abstraction xen_get_memory() relies on xc_map_foreign_range().
> >>> Doesn't this mean that it's not usable from PVH or HVM domains, or did I got it
> >>> all wrong?
> >> 
> >> That was my fault, so the buffer mfns are assigned to Xen, and then
> >> the Xen domain ID is used to map those, which should work on both PV
> >> and HVM (or PVH).
> >> 
> >> I still think using XENMEM_acquire_resource might be better, but I
> >> would let others comment.
> > 
> > +1
> > 
> > Jan
> 
> 
> I'm trying to implement this right now. I've added some very simple code to mm.c just for testing:
> 
> ---
> 
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index e376fc7e8f..aaaefe6d23 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4624,6 +4624,26 @@ int arch_acquire_resource(struct domain *d, unsigned int type,
>          }
>          break;
>      }
> +
> +    case XENMEM_resource_vmtrace_buf:
> +    {
> +        uint64_t output_base;
> +        mfn_t mfn;
> +        unsigned int i;
> +
> +        printk("vmtrace buf acquire\n");
> +        output_base = d->vcpu[id]->arch.hvm.vmx.ipt_state->output_base;
> +        mfn = mfn_x(output_base >> PAGE_SHIFT);
> +
> +        rc = 0;
> +        for ( i = 0; i < nr_frames; i++ )
> +        {
> +            __map_domain_page_global(mfn_to_page(mfn + i));

I don't think you need the __map_domain_page_global?

> +            mfn_list[i] = mfn + i;

I think you need mfn_add here, or else this won't build?

> +        }
> +
> +        break;
> +    }
>  #endif
> 
>      default:
> 
> ---
> 
> 
> and then in my "proctrace" tool I'm trying to acquire it like this:
> 
>     fres = xenforeignmemory_map_resource(
>         fmem, domid, XENMEM_resource_vmtrace_buf,
>         /* vcpu: */ 0, /* frame: */ 0, /* num_frames: */ 128, (void **)&buf,
>         PROT_READ, 0);
> 
> 
> ioctl fails with "Argument list too long". It works fine when I provide some small number of frames (e.g. num_frames: 1 or 32), but doesn't work for any larger quantity.
> 
> How should I proceed with this? The PT buffer could be large, even up to 4 GB.

I think adding a loop and hypercall continuation support could make
this work without having to expand the size of mfn_list and
gfn_list?

Thanks, Roger.

Michał Leszczyński June 18, 2020, 3:25 p.m. UTC | #11

----- 16 cze 2020 o 19:23, Roger Pau Monné roger.pau@citrix.com napisał(a):

> On Tue, Jun 16, 2020 at 05:22:06PM +0200, Michał Leszczyński wrote:
>> Provide an interface for privileged domains to manage
>> external IPT monitoring.
>> 
>> Signed-off-by: Michal Leszczynski <michal.leszczynski@cert.pl>
> 
> Thanks for the patch! I have some questions below which require your
> input.
> 
>> ---
>>  xen/arch/x86/hvm/hvm.c          | 170 ++++++++++++++++++++++++++++++++
>>  xen/include/public/hvm/hvm_op.h |  27 +++++
>>  2 files changed, 197 insertions(+)
>> 
>> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
>> index 5bb47583b3..9292caebe0 100644
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -4949,6 +4949,172 @@ static int compat_altp2m_op(
>>      return rc;
>>  }
>>  
>> +static int do_vmtrace_op(
>> +    XEN_GUEST_HANDLE_PARAM(void) arg)
> 
> No need for the newline, this can fit on a single line.
> 
>> +{
>> +    struct xen_hvm_vmtrace_op a;
>> +    struct domain *d = NULL;
> 
> I don't think you need to init d to NULL (at least by looking at the
> current code below).
> 
>> +    int rc = -EFAULT;
> 
> No need to init rc.
> 
>> +    int i;
> 
> unsigned since it's used as a loop counter.
> 
>> +    struct vcpu *v;
>> +    void* buf;
> 
> Nit: '*' should be prepended to the variable name.
> 
>> +    uint32_t buf_size;
> 
> size_t
> 
>> +    uint32_t buf_order;
> 
> Order is generally fine using unsigned int, no need to use a
> specifically sized type.
> 
>> +    uint64_t buf_mfn;
> 
> Could this use the mfn type?
> 
>> +    struct page_info *pg;
>> +
>> +    if ( !hvm_ipt_supported() )
>> +        return -EOPNOTSUPP;
>> +
>> +    if ( copy_from_guest(&a, arg, 1) )
>> +        return -EFAULT;
>> +
>> +    if ( a.version != HVMOP_VMTRACE_INTERFACE_VERSION )
>> +        return -EINVAL;
>> +
>> +    switch ( a.cmd )
>> +    {
>> +    case HVMOP_vmtrace_ipt_enable:
>> +    case HVMOP_vmtrace_ipt_disable:
>> +    case HVMOP_vmtrace_ipt_get_buf:
>> +    case HVMOP_vmtrace_ipt_get_offset:
>> +        break;
>> +
>> +    default:
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    d = rcu_lock_domain_by_any_id(a.domain);
>> +
>> +    if ( d == NULL )
>> +        return -ESRCH;
>> +
>> +    if ( !is_hvm_domain(d) )
>> +    {
>> +        rc = -EOPNOTSUPP;
>> +        goto out;
>> +    }
>> +
>> +    domain_pause(d);
>> +
>> +    if ( a.vcpu >= d->max_vcpus )
>> +    {
>> +        rc = -EINVAL;
>> +        goto out;
>> +    }
>> +
>> +    v = d->vcpu[a.vcpu];
>> +
>> +    if ( a.cmd == HVMOP_vmtrace_ipt_enable )
> 
> Please use a switch here, you might even consider re-using the switch
> from above and moving the domain checks before actually checking the
> command field, so that you don't need to perform two switches against
> a.cmd.
> 
>> +    {
>> +        if ( v->arch.hvm.vmx.ipt_state ) {
> 
> Coding style, brace should be on newline (there are more below which
> I'm not going to comment on).
> 
>> +            // already enabled
> 
> Comments should use /* ... */, there multiple instances of this below
> which I'm not going to comment on, please check CODING_STYLE.
> 
> Also, the interface looks racy, I think you are missing a lock to
> protect v->arch.hvm.vmx.ipt_state from being freed under your feet if
> you issue concurrent calls to the interface.
> 
>> +            rc = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        if ( a.size < PAGE_SIZE || a.size > 1000000 * PAGE_SIZE ) {
> 
> You can use GB(4) which is easier to read. Should the size also be a
> multiple of a PAGE_SIZE?
> 
>> +            // we don't accept trace buffer size smaller than single page
>> +            // and the upper bound is defined as 4GB in the specification
>> +            rc = -EINVAL;
>> +            goto out;
>> +	}
> 
> Stray tab.
> 
>> +
>> +        buf_order = get_order_from_bytes(a.size);
>> +
>> +        if ( (a.size >> PAGE_SHIFT) != (1 << buf_order) ) {
> 
> Oh here is the check. I think you can move this with the checks above
> by doing a.size & ~PAGE_MASK.
> 
>> +            rc = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        buf = page_to_virt(alloc_domheap_pages(d, buf_order,
>> MEMF_no_refcount));
> 
> What if alloc_domheap_pages return NULL?
> 
> Since I think you only what the linear address of the page to zero it
> I would suggest using clear_domain_page.
> 
>> +        buf_size = a.size;
>> +
>> +        if ( !buf ) {
>> +            rc = -EFAULT;
>> +            goto out;
>> +        }
>> +
>> +        memset(buf, 0, buf_size);
>> +
>> +        for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ ) {
>> +            share_xen_page_with_privileged_guests(virt_to_page(buf) + i,
>> SHARE_ro);
> 
> This line (and some more below) exceeds 80 characters, please split
> it.
> 
>> +        }
>> +
>> +        v->arch.hvm.vmx.ipt_state = xmalloc(struct ipt_state);
> 
> You should check that xmalloc has succeeds before trying to access
> ipt_state.
> 
>> +        v->arch.hvm.vmx.ipt_state->output_base = virt_to_mfn(buf) <<
>> PAGE_SHIFT;
>> +        v->arch.hvm.vmx.ipt_state->output_mask = buf_size - 1;
>> +        v->arch.hvm.vmx.ipt_state->status = 0;
>> +        v->arch.hvm.vmx.ipt_state->ctl = RTIT_CTL_TRACEEN | RTIT_CTL_OS |
>> RTIT_CTL_USR | RTIT_CTL_BRANCH_EN;
> 
> Shouldn't the user be able to select what tracing should be enabled?
> 
>> +    }
>> +    else if ( a.cmd == HVMOP_vmtrace_ipt_disable )
>> +    {
>> +        if ( !v->arch.hvm.vmx.ipt_state ) {
>> +            rc = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        buf_mfn = v->arch.hvm.vmx.ipt_state->output_base >> PAGE_SHIFT;
>> +        buf_size = ( v->arch.hvm.vmx.ipt_state->output_mask + 1 ) &
>> 0xFFFFFFFFUL;
>> +
>> +        for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ )
>> +        {
>> +            if ( (mfn_to_page(_mfn(buf_mfn + i))->count_info & PGC_count_mask)
>> != 1 )
>> +            {
>> +                rc = -EBUSY;
>> +                goto out;
>> +            }
>> +        }
>> +
>> +        xfree(v->arch.hvm.vmx.ipt_state);
>> +	v->arch.hvm.vmx.ipt_state = NULL;
>> +
>> +        for ( i = 0; i < (buf_size >> PAGE_SHIFT); i++ )
>> +        {
>> +            pg = mfn_to_page(_mfn(buf_mfn + i));
>> +            put_page_alloc_ref(pg);
>> +            if ( !test_and_clear_bit(_PGC_xen_heap, &pg->count_info) )
>> +                ASSERT_UNREACHABLE();
>> +            pg->u.inuse.type_info = 0;
>> +            page_set_owner(pg, NULL);
>> +            free_domheap_page(pg);
> 
> Hm, this seems fairly dangerous, what guarantees that the caller is
> not going to map the buffer while you are trying to tear it down?
> 
> You perform a check before freeing ipt_state, but between the check
> and the actual tearing down the domain might have setup mappings to
> them.
> 
> I wonder, could you expand a bit on why trace buffers are allocated
> from domheap memory by Xen?
> 
> There are a couple of options here, maybe the caller could provide
> it's own buffer, then Xen would take an extra reference to those pages
> and setup them to be used as buffers.
> 
> Another alternative would be to use domhep memory but not let the
> caller map it directly, and instead introduce a hypercall to copy
> from the internal Xen buffer into a user-provided one.
> 
> How much memory is used on average by those buffers? That would help
> decide a model that would best fit the usage.
> 
>> +        }
>> +    }
>> +    else if ( a.cmd == HVMOP_vmtrace_ipt_get_buf )
>> +    {
>> +        if ( !v->arch.hvm.vmx.ipt_state ) {
>> +            rc = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        a.mfn = v->arch.hvm.vmx.ipt_state->output_base >> PAGE_SHIFT;
> 
> This will not work for translated domains, ie: a PVH or HVM domain
> won't be able to use this interface since it has no way to request the
> mapping of a specific mfn into it's physmap. I think we need to take
> this into account when deciding how the interface should be, so that
> we don't corner ourselves with a PV only interface.
> 
>> +        a.size = (v->arch.hvm.vmx.ipt_state->output_mask + 1) & 0xFFFFFFFFUL;
> 
> You can truncate it easier by casting to uint32_t I think.
> 
> Or even better, you could put output_mask in a union like:
> 
> union {
>    uint64_t raw;
>    struct {
>        uint32_t size;
>	uint32_t offset;
>    }
> }
> 
> Then you can avoid the shifting and the castings.
> 
>> +    }
>> +    else if ( a.cmd == HVMOP_vmtrace_ipt_get_offset )
>> +    {
>> +        if ( !v->arch.hvm.vmx.ipt_state ) {
>> +            rc = -EINVAL;
>> +            goto out;
>> +        }
>> +
>> +        a.offset = v->arch.hvm.vmx.ipt_state->output_mask >> 32;
>> +    }
>> +
>> +    rc = -EFAULT;
>> +    if ( __copy_to_guest(arg, &a, 1) )
>> +      goto out;
>> +    rc = 0;
>> +
>> + out:
>> +    smp_wmb();
> 
> Why do you need a barrier here?
> 
>> +    domain_unpause(d);
>> +    rcu_unlock_domain(d);
>> +
>> +    return rc;
>> +}
>> +
>> +DEFINE_XEN_GUEST_HANDLE(compat_hvm_vmtrace_op_t);
>> +
>>  static int hvmop_get_mem_type(
>>      XEN_GUEST_HANDLE_PARAM(xen_hvm_get_mem_type_t) arg)
>>  {
>> @@ -5101,6 +5267,10 @@ long do_hvm_op(unsigned long op,
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>          rc = current->hcall_compat ? compat_altp2m_op(arg) : do_altp2m_op(arg);
>>          break;
>>  
>> +    case HVMOP_vmtrace:
>> +        rc = do_vmtrace_op(arg);
>> +        break;
>> +
>>      default:
>>      {
>>          gdprintk(XENLOG_DEBUG, "Bad HVM op %ld.\n", op);
>> diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
>> index 870ec52060..3bbcd54c96 100644
>> --- a/xen/include/public/hvm/hvm_op.h
>> +++ b/xen/include/public/hvm/hvm_op.h
>> @@ -382,6 +382,33 @@ struct xen_hvm_altp2m_op {
>>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>>  
>> +/* HVMOP_vmtrace: Perform VM tracing related operation */
>> +#define HVMOP_vmtrace 26
>> +
>> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x00000001
>> +
>> +struct xen_hvm_vmtrace_op {
>> +    /* IN variable */
>> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
>> +    uint32_t cmd;
>> +/* Enable/disable external vmtrace for given domain */
>> +#define HVMOP_vmtrace_ipt_enable      1
>> +#define HVMOP_vmtrace_ipt_disable     2
>> +#define HVMOP_vmtrace_ipt_get_buf     3
>> +#define HVMOP_vmtrace_ipt_get_offset  4
>> +    domid_t domain;
> 
> You are missing a padding field here AFAICT.


Could you point out what is the purpose of this padding field and what should be the size in this case? It's pretty unclear to me.


> 
> Roger.

Jan Beulich June 18, 2020, 3:39 p.m. UTC | #12

On 18.06.2020 17:25, Michał Leszczyński wrote:
> ----- 16 cze 2020 o 19:23, Roger Pau Monné roger.pau@citrix.com napisał(a):
>> On Tue, Jun 16, 2020 at 05:22:06PM +0200, Michał Leszczyński wrote:
>>> --- a/xen/include/public/hvm/hvm_op.h
>>> +++ b/xen/include/public/hvm/hvm_op.h
>>> @@ -382,6 +382,33 @@ struct xen_hvm_altp2m_op {
>>>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>>>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>>>  
>>> +/* HVMOP_vmtrace: Perform VM tracing related operation */
>>> +#define HVMOP_vmtrace 26
>>> +
>>> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x00000001
>>> +
>>> +struct xen_hvm_vmtrace_op {
>>> +    /* IN variable */
>>> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
>>> +    uint32_t cmd;
>>> +/* Enable/disable external vmtrace for given domain */
>>> +#define HVMOP_vmtrace_ipt_enable      1
>>> +#define HVMOP_vmtrace_ipt_disable     2
>>> +#define HVMOP_vmtrace_ipt_get_buf     3
>>> +#define HVMOP_vmtrace_ipt_get_offset  4
>>> +    domid_t domain;
>>
>> You are missing a padding field here AFAICT.
> 
> 
> Could you point out what is the purpose of this padding field and what should be the size in this case? It's pretty unclear to me.

In the public interface we aim at making all padding explicit.

Also, as a general remark: Please trim your replies.

Jan

Tamas K Lengyel June 18, 2020, 3:47 p.m. UTC | #13

On Thu, Jun 18, 2020 at 9:41 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 18.06.2020 17:25, Michał Leszczyński wrote:
> > ----- 16 cze 2020 o 19:23, Roger Pau Monné roger.pau@citrix.com napisał(a):
> >> On Tue, Jun 16, 2020 at 05:22:06PM +0200, Michał Leszczyński wrote:
> >>> --- a/xen/include/public/hvm/hvm_op.h
> >>> +++ b/xen/include/public/hvm/hvm_op.h
> >>> @@ -382,6 +382,33 @@ struct xen_hvm_altp2m_op {
> >>>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
> >>>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
> >>>
> >>> +/* HVMOP_vmtrace: Perform VM tracing related operation */
> >>> +#define HVMOP_vmtrace 26
> >>> +
> >>> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x00000001
> >>> +
> >>> +struct xen_hvm_vmtrace_op {
> >>> +    /* IN variable */
> >>> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
> >>> +    uint32_t cmd;
> >>> +/* Enable/disable external vmtrace for given domain */
> >>> +#define HVMOP_vmtrace_ipt_enable      1
> >>> +#define HVMOP_vmtrace_ipt_disable     2
> >>> +#define HVMOP_vmtrace_ipt_get_buf     3
> >>> +#define HVMOP_vmtrace_ipt_get_offset  4
> >>> +    domid_t domain;
> >>
> >> You are missing a padding field here AFAICT.
> >
> >
> > Could you point out what is the purpose of this padding field and what should be the size in this case? It's pretty unclear to me.
>
> In the public interface we aim at making all padding explicit.

Just to expand a bit on this: this is an ABI meaning the hypervisor
and the tool sending this structure communicate via memory only. Since
the hypervisor and the compiler can be compiled using different
compilers, using stuff that's not explicit in the C standard needs to
be avoided. For example using standard types like "int" or "long" is
no good since it's really up to the compiler to decide how much memory
those need. The C specification is great like that.. Same stands for
padding. Different compilers can decide to align things differently,
pack things or not pack things, so we have to manually add the padding
to take that choice away from the compiler.

As discussed many times on the list, using C struct as the base of the
ABI was a bad design decision to start with, but we are now stuck with
it. It would now make more sense to use something like JSON to pass
information like this between the hypervisor and the toolstack which
is what we opted to do in new hypervisors like Bareflank/Boxy.

Tamas

Tamas K Lengyel June 18, 2020, 3:49 p.m. UTC | #14

On Thu, Jun 18, 2020 at 9:47 AM Tamas K Lengyel
<tamas.k.lengyel@gmail.com> wrote:
>
> On Thu, Jun 18, 2020 at 9:41 AM Jan Beulich <jbeulich@suse.com> wrote:
> >
> > On 18.06.2020 17:25, Michał Leszczyński wrote:
> > > ----- 16 cze 2020 o 19:23, Roger Pau Monné roger.pau@citrix.com napisał(a):
> > >> On Tue, Jun 16, 2020 at 05:22:06PM +0200, Michał Leszczyński wrote:
> > >>> --- a/xen/include/public/hvm/hvm_op.h
> > >>> +++ b/xen/include/public/hvm/hvm_op.h
> > >>> @@ -382,6 +382,33 @@ struct xen_hvm_altp2m_op {
> > >>>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
> > >>>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
> > >>>
> > >>> +/* HVMOP_vmtrace: Perform VM tracing related operation */
> > >>> +#define HVMOP_vmtrace 26
> > >>> +
> > >>> +#define HVMOP_VMTRACE_INTERFACE_VERSION 0x00000001
> > >>> +
> > >>> +struct xen_hvm_vmtrace_op {
> > >>> +    /* IN variable */
> > >>> +    uint32_t version;   /* HVMOP_VMTRACE_INTERFACE_VERSION */
> > >>> +    uint32_t cmd;
> > >>> +/* Enable/disable external vmtrace for given domain */
> > >>> +#define HVMOP_vmtrace_ipt_enable      1
> > >>> +#define HVMOP_vmtrace_ipt_disable     2
> > >>> +#define HVMOP_vmtrace_ipt_get_buf     3
> > >>> +#define HVMOP_vmtrace_ipt_get_offset  4
> > >>> +    domid_t domain;
> > >>
> > >> You are missing a padding field here AFAICT.
> > >
> > >
> > > Could you point out what is the purpose of this padding field and what should be the size in this case? It's pretty unclear to me.
> >
> > In the public interface we aim at making all padding explicit.
>
> Just to expand a bit on this: this is an ABI meaning the hypervisor
> and the tool sending this structure communicate via memory only. Since
> the hypervisor and the compiler can be compiled using different

   ^ meant to write "hypervisor and the toolstack" above

Tamas

[v1,4/7] x86/vmx: add do_vmtrace_op

Commit Message

Comments

Patch