diff mbox

[v9,4/5] x86/ioreq server: Asynchronously reset outstanding p2m_ioreq_server entries.

Message ID 1490064773-26751-5-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yu Zhang March 21, 2017, 2:52 a.m. UTC
After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
asynchronously with the current p2m_change_entry_type_global()
interface.

This patch also disallows live migration, when there's still any
outstanding p2m_ioreq_server entry left. The core reason is our
current implementation of p2m_change_entry_type_global() can not
tell the state of p2m_ioreq_server entries(can not decide if an
entry is to be emulated or to be resynced).

Note: new field entry_count is introduced in struct p2m_domain,
to record the number of p2m_ioreq_server p2m page table entries.
One nature of these entries is that they only point to 4K sized
page frames, because all p2m_ioreq_server entries are originated
from p2m_ram_rw ones in p2m_change_type_one(). We do not need to
worry about the counting for 2M/1G sized pages.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>

changes in v4: 
  - According to comments from Jan: use ASSERT() instead of 'if'
    condition in p2m_change_type_one().
  - According to comments from Jan: commit message changes to mention
    the p2m_ioreq_server are all based on 4K sized pages.

changes in v3: 
  - Move the synchronously resetting logic into patch 5.
  - According to comments from Jan: introduce p2m_check_changeable()
    to clarify the p2m type change code.
  - According to comments from George: use locks in the same order
    to avoid deadlock, call p2m_change_entry_type_global() after unmap
    of the ioreq server is finished.

changes in v2: 
  - Move the calculation of ioreq server page entry_cout into 
    p2m_change_type_one() so that we do not need a seperate lock.
    Note: entry_count is also calculated in resolve_misconfig()/
    do_recalc(), fortunately callers of both routines have p2m 
    lock protected already.
  - Simplify logic in hvmop_set_mem_type().
  - Introduce routine p2m_finish_type_change() to walk the p2m 
    table and do the p2m reset.
---
 xen/arch/x86/hvm/ioreq.c  |  8 ++++++++
 xen/arch/x86/mm/hap/hap.c |  9 +++++++++
 xen/arch/x86/mm/p2m-ept.c |  8 +++++++-
 xen/arch/x86/mm/p2m-pt.c  | 13 +++++++++++--
 xen/arch/x86/mm/p2m.c     | 20 ++++++++++++++++++++
 xen/include/asm-x86/p2m.h |  9 ++++++++-
 6 files changed, 63 insertions(+), 4 deletions(-)

Comments

Paul Durrant March 21, 2017, 10:05 a.m. UTC | #1
> -----Original Message-----
> From: Yu Zhang [mailto:yu.c.zhang@linux.intel.com]
> Sent: 21 March 2017 02:53
> To: xen-devel@lists.xen.org
> Cc: zhiyuan.lv@intel.com; Paul Durrant <Paul.Durrant@citrix.com>; Jan
> Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>; Jun Nakajima <jun.nakajima@intel.com>;
> Kevin Tian <kevin.tian@intel.com>
> Subject: [PATCH v9 4/5] x86/ioreq server: Asynchronously reset outstanding
> p2m_ioreq_server entries.
> 
> After an ioreq server has unmapped, the remaining p2m_ioreq_server
> entries need to be reset back to p2m_ram_rw. This patch does this
> asynchronously with the current p2m_change_entry_type_global()
> interface.
> 
> This patch also disallows live migration, when there's still any
> outstanding p2m_ioreq_server entry left. The core reason is our
> current implementation of p2m_change_entry_type_global() can not
> tell the state of p2m_ioreq_server entries(can not decide if an
> entry is to be emulated or to be resynced).
> 
> Note: new field entry_count is introduced in struct p2m_domain,
> to record the number of p2m_ioreq_server p2m page table entries.
> One nature of these entries is that they only point to 4K sized
> page frames, because all p2m_ioreq_server entries are originated
> from p2m_ram_rw ones in p2m_change_type_one(). We do not need to
> worry about the counting for 2M/1G sized pages.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>

Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

> ---
> Cc: Paul Durrant <paul.durrant@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> 
> changes in v4:
>   - According to comments from Jan: use ASSERT() instead of 'if'
>     condition in p2m_change_type_one().
>   - According to comments from Jan: commit message changes to mention
>     the p2m_ioreq_server are all based on 4K sized pages.
> 
> changes in v3:
>   - Move the synchronously resetting logic into patch 5.
>   - According to comments from Jan: introduce p2m_check_changeable()
>     to clarify the p2m type change code.
>   - According to comments from George: use locks in the same order
>     to avoid deadlock, call p2m_change_entry_type_global() after unmap
>     of the ioreq server is finished.
> 
> changes in v2:
>   - Move the calculation of ioreq server page entry_cout into
>     p2m_change_type_one() so that we do not need a seperate lock.
>     Note: entry_count is also calculated in resolve_misconfig()/
>     do_recalc(), fortunately callers of both routines have p2m
>     lock protected already.
>   - Simplify logic in hvmop_set_mem_type().
>   - Introduce routine p2m_finish_type_change() to walk the p2m
>     table and do the p2m reset.
> ---
>  xen/arch/x86/hvm/ioreq.c  |  8 ++++++++
>  xen/arch/x86/mm/hap/hap.c |  9 +++++++++
>  xen/arch/x86/mm/p2m-ept.c |  8 +++++++-
>  xen/arch/x86/mm/p2m-pt.c  | 13 +++++++++++--
>  xen/arch/x86/mm/p2m.c     | 20 ++++++++++++++++++++
>  xen/include/asm-x86/p2m.h |  9 ++++++++-
>  6 files changed, 63 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
> index 746799f..102c6c2 100644
> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -949,6 +949,14 @@ int hvm_map_mem_type_to_ioreq_server(struct
> domain *d, ioservid_t id,
> 
>      spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
> 
> +    if ( rc == 0 && flags == 0 )
> +    {
> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +        if ( read_atomic(&p2m->ioreq.entry_count) )
> +            p2m_change_entry_type_global(d, p2m_ioreq_server,
> p2m_ram_rw);
> +    }
> +
>      return rc;
>  }
> 
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index a57b385..6ec950a 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -187,6 +187,15 @@ out:
>   */
>  static int hap_enable_log_dirty(struct domain *d, bool_t log_global)
>  {
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +    /*
> +     * Refuse to turn on global log-dirty mode if
> +     * there's outstanding p2m_ioreq_server pages.
> +     */
> +    if ( log_global && read_atomic(&p2m->ioreq.entry_count) )
> +        return -EBUSY;
> +
>      /* turn on PG_log_dirty bit in paging mode */
>      paging_lock(d);
>      d->arch.paging.mode |= PG_log_dirty;
> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
> index cc1eb21..1df3d09 100644
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain
> *p2m, unsigned long gfn)
>                      e.ipat = ipat;
>                      if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>                      {
> +                         if ( e.sa_p2mt == p2m_ioreq_server )
> +                         {
> +                             p2m->ioreq.entry_count--;
> +                             ASSERT(p2m->ioreq.entry_count >= 0);
> +                         }
> +
>                           e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
>                                       ? p2m_ram_logdirty : p2m_ram_rw;
>                           ept_p2m_type_to_flags(p2m, &e, e.sa_p2mt, e.access);
> @@ -965,7 +971,7 @@ static mfn_t ept_get_entry(struct p2m_domain
> *p2m,
>      if ( is_epte_valid(ept_entry) )
>      {
>          if ( (recalc || ept_entry->recalc) &&
> -             p2m_is_changeable(ept_entry->sa_p2mt) )
> +             p2m_check_changeable(ept_entry->sa_p2mt) )
>              *t = p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
>                                                        : p2m_ram_rw;
>          else
> diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
> index f6c45ec..169de75 100644
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -436,11 +436,13 @@ static int do_recalc(struct p2m_domain *p2m,
> unsigned long gfn)
>           needs_recalc(l1, *pent) )
>      {
>          l1_pgentry_t e = *pent;
> +        p2m_type_t p2mt_old;
> 
>          if ( !valid_recalc(l1, e) )
>              P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
>                        p2m->domain->domain_id, gfn, level);
> -        if ( p2m_is_changeable(p2m_flags_to_type(l1e_get_flags(e))) )
> +        p2mt_old = p2m_flags_to_type(l1e_get_flags(e));
> +        if ( p2m_is_changeable(p2mt_old) )
>          {
>              unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
>              p2m_type_t p2mt = p2m_is_logdirty_range(p2m, gfn & mask, gfn |
> ~mask)
> @@ -460,6 +462,13 @@ static int do_recalc(struct p2m_domain *p2m,
> unsigned long gfn)
>                       mfn &= ~(_PAGE_PSE_PAT >> PAGE_SHIFT);
>                  flags |= _PAGE_PSE;
>              }
> +
> +            if ( p2mt_old == p2m_ioreq_server )
> +            {
> +                p2m->ioreq.entry_count--;
> +                ASSERT(p2m->ioreq.entry_count >= 0);
> +            }
> +
>              e = l1e_from_pfn(mfn, flags);
>              p2m_add_iommu_flags(&e, level,
>                                  (p2mt == p2m_ram_rw)
> @@ -729,7 +738,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
> unsigned long gfn, mfn_t mfn,
>  static inline p2m_type_t recalc_type(bool_t recalc, p2m_type_t t,
>                                       struct p2m_domain *p2m, unsigned long gfn)
>  {
> -    if ( !recalc || !p2m_is_changeable(t) )
> +    if ( !recalc || !p2m_check_changeable(t) )
>          return t;
>      return p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
>                                                  : p2m_ram_rw;
> diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
> index dd4e477..e3e54f1 100644
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -954,6 +954,26 @@ int p2m_change_type_one(struct domain *d,
> unsigned long gfn,
>                           p2m->default_access)
>           : -EBUSY;
> 
> +    if ( !rc )
> +    {
> +        switch ( nt )
> +        {
> +        case p2m_ram_rw:
> +            if ( ot == p2m_ioreq_server )
> +            {
> +                p2m->ioreq.entry_count--;
> +                ASSERT(p2m->ioreq.entry_count >= 0);
> +            }
> +            break;
> +        case p2m_ioreq_server:
> +            ASSERT(ot == p2m_ram_rw);
> +            p2m->ioreq.entry_count++;
> +            break;
> +        default:
> +            break;
> +        }
> +    }
> +
>      gfn_unlock(p2m, gfn, 0);
> 
>      return rc;
> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
> index 3786680..395f125 100644
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -120,7 +120,10 @@ typedef unsigned int p2m_query_t;
> 
>  /* Types that can be subject to bulk transitions. */
>  #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
> -                              | p2m_to_mask(p2m_ram_logdirty) )
> +                              | p2m_to_mask(p2m_ram_logdirty) \
> +                              | p2m_to_mask(p2m_ioreq_server) )
> +
> +#define P2M_IOREQ_TYPES (p2m_to_mask(p2m_ioreq_server))
> 
>  #define P2M_POD_TYPES (p2m_to_mask(p2m_populate_on_demand))
> 
> @@ -157,6 +160,7 @@ typedef unsigned int p2m_query_t;
>  #define p2m_is_readonly(_t) (p2m_to_mask(_t) & P2M_RO_TYPES)
>  #define p2m_is_discard_write(_t) (p2m_to_mask(_t) &
> P2M_DISCARD_WRITE_TYPES)
>  #define p2m_is_changeable(_t) (p2m_to_mask(_t) &
> P2M_CHANGEABLE_TYPES)
> +#define p2m_is_ioreq(_t) (p2m_to_mask(_t) & P2M_IOREQ_TYPES)
>  #define p2m_is_pod(_t) (p2m_to_mask(_t) & P2M_POD_TYPES)
>  #define p2m_is_grant(_t) (p2m_to_mask(_t) & P2M_GRANT_TYPES)
>  /* Grant types are *not* considered valid, because they can be
> @@ -178,6 +182,8 @@ typedef unsigned int p2m_query_t;
> 
>  #define p2m_allows_invalid_mfn(t) (p2m_to_mask(t) &
> P2M_INVALID_MFN_TYPES)
> 
> +#define p2m_check_changeable(t) (p2m_is_changeable(t) &&
> !p2m_is_ioreq(t))
> +
>  typedef enum {
>      p2m_host,
>      p2m_nested,
> @@ -349,6 +355,7 @@ struct p2m_domain {
>            * are to be emulated by an ioreq server.
>            */
>           unsigned int flags;
> +         long entry_count;
>       } ioreq;
>  };
> 
> --
> 1.9.1
Tian, Kevin March 22, 2017, 8:10 a.m. UTC | #2
> From: Yu Zhang [mailto:yu.c.zhang@linux.intel.com]
> Sent: Tuesday, March 21, 2017 10:53 AM
> 
> After an ioreq server has unmapped, the remaining p2m_ioreq_server
> entries need to be reset back to p2m_ram_rw. This patch does this
> asynchronously with the current p2m_change_entry_type_global() interface.
> 
> This patch also disallows live migration, when there's still any outstanding
> p2m_ioreq_server entry left. The core reason is our current implementation
> of p2m_change_entry_type_global() can not tell the state of
> p2m_ioreq_server entries(can not decide if an entry is to be emulated or to
> be resynced).

Don't quite get this point. change_global is triggered only upon
unmap. At that point there is no ioreq server to emulate the
write operations on those entries. All the things required is
just to change the type. What's the exact decision required here?

btw does it mean that live migration can be still supported as long as
device model proactively unmaps write-protected pages before
starting live migration?


> 
> Note: new field entry_count is introduced in struct p2m_domain, to record
> the number of p2m_ioreq_server p2m page table entries.
> One nature of these entries is that they only point to 4K sized page frames,
> because all p2m_ioreq_server entries are originated from p2m_ram_rw ones
> in p2m_change_type_one(). We do not need to worry about the counting for
> 2M/1G sized pages.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> ---
> Cc: Paul Durrant <paul.durrant@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> 
> changes in v4:
>   - According to comments from Jan: use ASSERT() instead of 'if'
>     condition in p2m_change_type_one().
>   - According to comments from Jan: commit message changes to mention
>     the p2m_ioreq_server are all based on 4K sized pages.
> 
> changes in v3:
>   - Move the synchronously resetting logic into patch 5.
>   - According to comments from Jan: introduce p2m_check_changeable()
>     to clarify the p2m type change code.
>   - According to comments from George: use locks in the same order
>     to avoid deadlock, call p2m_change_entry_type_global() after unmap
>     of the ioreq server is finished.
> 
> changes in v2:
>   - Move the calculation of ioreq server page entry_cout into
>     p2m_change_type_one() so that we do not need a seperate lock.
>     Note: entry_count is also calculated in resolve_misconfig()/
>     do_recalc(), fortunately callers of both routines have p2m
>     lock protected already.
>   - Simplify logic in hvmop_set_mem_type().
>   - Introduce routine p2m_finish_type_change() to walk the p2m
>     table and do the p2m reset.
> ---
>  xen/arch/x86/hvm/ioreq.c  |  8 ++++++++  xen/arch/x86/mm/hap/hap.c |  9
> +++++++++  xen/arch/x86/mm/p2m-ept.c |  8 +++++++-
> xen/arch/x86/mm/p2m-pt.c  | 13 +++++++++++--
>  xen/arch/x86/mm/p2m.c     | 20 ++++++++++++++++++++
>  xen/include/asm-x86/p2m.h |  9 ++++++++-
>  6 files changed, 63 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c index
> 746799f..102c6c2 100644
> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -949,6 +949,14 @@ int hvm_map_mem_type_to_ioreq_server(struct
> domain *d, ioservid_t id,
> 
>      spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
> 
> +    if ( rc == 0 && flags == 0 )
> +    {
> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +        if ( read_atomic(&p2m->ioreq.entry_count) )
> +            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
> +    }
> +
>      return rc;
>  }
> 
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index a57b385..6ec950a 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -187,6 +187,15 @@ out:
>   */
>  static int hap_enable_log_dirty(struct domain *d, bool_t log_global)  {
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +    /*
> +     * Refuse to turn on global log-dirty mode if
> +     * there's outstanding p2m_ioreq_server pages.
> +     */
> +    if ( log_global && read_atomic(&p2m->ioreq.entry_count) )
> +        return -EBUSY;

I know this has been discussed before, but didn't remember
the detail reason - why cannot allow log-dirty mode when 
there are still outstanding p2m_ioreq_server pages? Cannot
we mark related page as dirty when forwarding write emulation
request to corresponding ioreq server?

> +
>      /* turn on PG_log_dirty bit in paging mode */
>      paging_lock(d);
>      d->arch.paging.mode |= PG_log_dirty; diff --git a/xen/arch/x86/mm/p2m-
> ept.c b/xen/arch/x86/mm/p2m-ept.c index cc1eb21..1df3d09 100644
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain
> *p2m, unsigned long gfn)
>                      e.ipat = ipat;
>                      if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>                      {
> +                         if ( e.sa_p2mt == p2m_ioreq_server )
> +                         {
> +                             p2m->ioreq.entry_count--;
> +                             ASSERT(p2m->ioreq.entry_count >= 0);
> +                         }
> +
>                           e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
>                                       ? p2m_ram_logdirty : p2m_ram_rw;
>                           ept_p2m_type_to_flags(p2m, &e, e.sa_p2mt, e.access); @@ -
> 965,7 +971,7 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m,
>      if ( is_epte_valid(ept_entry) )
>      {
>          if ( (recalc || ept_entry->recalc) &&
> -             p2m_is_changeable(ept_entry->sa_p2mt) )
> +             p2m_check_changeable(ept_entry->sa_p2mt) )
>              *t = p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
>                                                        : p2m_ram_rw;
>          else
> diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c index
> f6c45ec..169de75 100644
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -436,11 +436,13 @@ static int do_recalc(struct p2m_domain *p2m,
> unsigned long gfn)
>           needs_recalc(l1, *pent) )
>      {
>          l1_pgentry_t e = *pent;
> +        p2m_type_t p2mt_old;
> 
>          if ( !valid_recalc(l1, e) )
>              P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
>                        p2m->domain->domain_id, gfn, level);
> -        if ( p2m_is_changeable(p2m_flags_to_type(l1e_get_flags(e))) )
> +        p2mt_old = p2m_flags_to_type(l1e_get_flags(e));
> +        if ( p2m_is_changeable(p2mt_old) )
>          {
>              unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
>              p2m_type_t p2mt = p2m_is_logdirty_range(p2m, gfn & mask, gfn |
> ~mask) @@ -460,6 +462,13 @@ static int do_recalc(struct p2m_domain
> *p2m, unsigned long gfn)
>                       mfn &= ~(_PAGE_PSE_PAT >> PAGE_SHIFT);
>                  flags |= _PAGE_PSE;
>              }
> +
> +            if ( p2mt_old == p2m_ioreq_server )
> +            {
> +                p2m->ioreq.entry_count--;
> +                ASSERT(p2m->ioreq.entry_count >= 0);
> +            }
> +
>              e = l1e_from_pfn(mfn, flags);
>              p2m_add_iommu_flags(&e, level,
>                                  (p2mt == p2m_ram_rw) @@ -729,7 +738,7 @@
> p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
> static inline p2m_type_t recalc_type(bool_t recalc, p2m_type_t t,
>                                       struct p2m_domain *p2m, unsigned long gfn)  {
> -    if ( !recalc || !p2m_is_changeable(t) )
> +    if ( !recalc || !p2m_check_changeable(t) )
>          return t;
>      return p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
>                                                  : p2m_ram_rw; diff --git
> a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c index
> dd4e477..e3e54f1 100644
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -954,6 +954,26 @@ int p2m_change_type_one(struct domain *d,
> unsigned long gfn,
>                           p2m->default_access)
>           : -EBUSY;
> 
> +    if ( !rc )
> +    {
> +        switch ( nt )
> +        {
> +        case p2m_ram_rw:
> +            if ( ot == p2m_ioreq_server )
> +            {
> +                p2m->ioreq.entry_count--;
> +                ASSERT(p2m->ioreq.entry_count >= 0);
> +            }
> +            break;
> +        case p2m_ioreq_server:
> +            ASSERT(ot == p2m_ram_rw);
> +            p2m->ioreq.entry_count++;
> +            break;
> +        default:
> +            break;
> +        }
> +    }
> +
>      gfn_unlock(p2m, gfn, 0);
> 
>      return rc;
> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h index
> 3786680..395f125 100644
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -120,7 +120,10 @@ typedef unsigned int p2m_query_t;
> 
>  /* Types that can be subject to bulk transitions. */  #define
> P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
> -                              | p2m_to_mask(p2m_ram_logdirty) )
> +                              | p2m_to_mask(p2m_ram_logdirty) \
> +                              | p2m_to_mask(p2m_ioreq_server) )
> +
> +#define P2M_IOREQ_TYPES (p2m_to_mask(p2m_ioreq_server))
> 
>  #define P2M_POD_TYPES (p2m_to_mask(p2m_populate_on_demand))
> 
> @@ -157,6 +160,7 @@ typedef unsigned int p2m_query_t;  #define
> p2m_is_readonly(_t) (p2m_to_mask(_t) & P2M_RO_TYPES)  #define
> p2m_is_discard_write(_t) (p2m_to_mask(_t) & P2M_DISCARD_WRITE_TYPES)
> #define p2m_is_changeable(_t) (p2m_to_mask(_t) &
> P2M_CHANGEABLE_TYPES)
> +#define p2m_is_ioreq(_t) (p2m_to_mask(_t) & P2M_IOREQ_TYPES)
>  #define p2m_is_pod(_t) (p2m_to_mask(_t) & P2M_POD_TYPES)  #define
> p2m_is_grant(_t) (p2m_to_mask(_t) & P2M_GRANT_TYPES)
>  /* Grant types are *not* considered valid, because they can be @@ -178,6
> +182,8 @@ typedef unsigned int p2m_query_t;
> 
>  #define p2m_allows_invalid_mfn(t) (p2m_to_mask(t) &
> P2M_INVALID_MFN_TYPES)
> 
> +#define p2m_check_changeable(t) (p2m_is_changeable(t) &&
> +!p2m_is_ioreq(t))
> +
>  typedef enum {
>      p2m_host,
>      p2m_nested,
> @@ -349,6 +355,7 @@ struct p2m_domain {
>            * are to be emulated by an ioreq server.
>            */
>           unsigned int flags;
> +         long entry_count;
>       } ioreq;
>  };
> 
> --
> 1.9.1
Yu Zhang March 22, 2017, 10:12 a.m. UTC | #3
On 3/22/2017 4:10 PM, Tian, Kevin wrote:
>> From: Yu Zhang [mailto:yu.c.zhang@linux.intel.com]
>> Sent: Tuesday, March 21, 2017 10:53 AM
>>
>> After an ioreq server has unmapped, the remaining p2m_ioreq_server
>> entries need to be reset back to p2m_ram_rw. This patch does this
>> asynchronously with the current p2m_change_entry_type_global() interface.
>>
>> This patch also disallows live migration, when there's still any outstanding
>> p2m_ioreq_server entry left. The core reason is our current implementation
>> of p2m_change_entry_type_global() can not tell the state of
>> p2m_ioreq_server entries(can not decide if an entry is to be emulated or to
>> be resynced).
> Don't quite get this point. change_global is triggered only upon
> unmap. At that point there is no ioreq server to emulate the
> write operations on those entries. All the things required is
> just to change the type. What's the exact decision required here?

Well, one situation I can recall is that if another ioreq server maps to 
this type,
and live migration happens later. The resolve_misconfig() code cannot 
differentiate
if an p2m_ioreq_server page is an obsolete to be synced, or a new one to 
be only
emulated.

I gave some explanation on this issue in discussion during Jun 20 - 22 last
year.

http://lists.xenproject.org/archives/html/xen-devel/2016-06/msg02426.html
on Jun 20
and
http://lists.xenproject.org/archives/html/xen-devel/2016-06/msg02575.html
on Jun 21

> btw does it mean that live migration can be still supported as long as
> device model proactively unmaps write-protected pages before
> starting live migration?
>

Yes.

>> Note: new field entry_count is introduced in struct p2m_domain, to record
>> the number of p2m_ioreq_server p2m page table entries.
>> One nature of these entries is that they only point to 4K sized page frames,
>> because all p2m_ioreq_server entries are originated from p2m_ram_rw ones
>> in p2m_change_type_one(). We do not need to worry about the counting for
>> 2M/1G sized pages.
>>
>> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> ---
>> Cc: Paul Durrant <paul.durrant@citrix.com>
>> Cc: Jan Beulich <jbeulich@suse.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: George Dunlap <george.dunlap@eu.citrix.com>
>> Cc: Jun Nakajima <jun.nakajima@intel.com>
>> Cc: Kevin Tian <kevin.tian@intel.com>
>>
>> changes in v4:
>>    - According to comments from Jan: use ASSERT() instead of 'if'
>>      condition in p2m_change_type_one().
>>    - According to comments from Jan: commit message changes to mention
>>      the p2m_ioreq_server are all based on 4K sized pages.
>>
>> changes in v3:
>>    - Move the synchronously resetting logic into patch 5.
>>    - According to comments from Jan: introduce p2m_check_changeable()
>>      to clarify the p2m type change code.
>>    - According to comments from George: use locks in the same order
>>      to avoid deadlock, call p2m_change_entry_type_global() after unmap
>>      of the ioreq server is finished.
>>
>> changes in v2:
>>    - Move the calculation of ioreq server page entry_cout into
>>      p2m_change_type_one() so that we do not need a seperate lock.
>>      Note: entry_count is also calculated in resolve_misconfig()/
>>      do_recalc(), fortunately callers of both routines have p2m
>>      lock protected already.
>>    - Simplify logic in hvmop_set_mem_type().
>>    - Introduce routine p2m_finish_type_change() to walk the p2m
>>      table and do the p2m reset.
>> ---
>>   xen/arch/x86/hvm/ioreq.c  |  8 ++++++++  xen/arch/x86/mm/hap/hap.c |  9
>> +++++++++  xen/arch/x86/mm/p2m-ept.c |  8 +++++++-
>> xen/arch/x86/mm/p2m-pt.c  | 13 +++++++++++--
>>   xen/arch/x86/mm/p2m.c     | 20 ++++++++++++++++++++
>>   xen/include/asm-x86/p2m.h |  9 ++++++++-
>>   6 files changed, 63 insertions(+), 4 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c index
>> 746799f..102c6c2 100644
>> --- a/xen/arch/x86/hvm/ioreq.c
>> +++ b/xen/arch/x86/hvm/ioreq.c
>> @@ -949,6 +949,14 @@ int hvm_map_mem_type_to_ioreq_server(struct
>> domain *d, ioservid_t id,
>>
>>       spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
>>
>> +    if ( rc == 0 && flags == 0 )
>> +    {
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +
>> +        if ( read_atomic(&p2m->ioreq.entry_count) )
>> +            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>> +    }
>> +
>>       return rc;
>>   }
>>
>> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
>> index a57b385..6ec950a 100644
>> --- a/xen/arch/x86/mm/hap/hap.c
>> +++ b/xen/arch/x86/mm/hap/hap.c
>> @@ -187,6 +187,15 @@ out:
>>    */
>>   static int hap_enable_log_dirty(struct domain *d, bool_t log_global)  {
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +
>> +    /*
>> +     * Refuse to turn on global log-dirty mode if
>> +     * there's outstanding p2m_ioreq_server pages.
>> +     */
>> +    if ( log_global && read_atomic(&p2m->ioreq.entry_count) )
>> +        return -EBUSY;
> I know this has been discussed before, but didn't remember
> the detail reason - why cannot allow log-dirty mode when
> there are still outstanding p2m_ioreq_server pages? Cannot
> we mark related page as dirty when forwarding write emulation
> request to corresponding ioreq server?

IIUC, changing a paging to p2m_log_dirty will only mark it as read-only 
once,
and after it is marked as dirty, it will be changed back to p2m_ram_rw. 
Also,
handling of a p2m_log_dirty page is quite different, we'd like the 
p2m_ioreq_server
ones be handled like p2m_mmio_dm.

Thanks
Yu

[snip]
Jan Beulich March 22, 2017, 2:29 p.m. UTC | #4
>>> On 21.03.17 at 03:52, <yu.c.zhang@linux.intel.com> wrote:
> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -949,6 +949,14 @@ int hvm_map_mem_type_to_ioreq_server(struct domain *d, ioservid_t id,
>  
>      spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
>  
> +    if ( rc == 0 && flags == 0 )
> +    {
> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +        if ( read_atomic(&p2m->ioreq.entry_count) )
> +            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
> +    }

If you do this after dropping the lock, don't you risk a race with
another server mapping the type to itself?

> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m, 
> unsigned long gfn)
>                      e.ipat = ipat;
>                      if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>                      {
> +                         if ( e.sa_p2mt == p2m_ioreq_server )
> +                         {
> +                             p2m->ioreq.entry_count--;
> +                             ASSERT(p2m->ioreq.entry_count >= 0);

If you did the ASSERT() first (using > 0), you wouldn't need the
type be a signed one, doubling the valid value range (even if
right now the full 64 bits can't be used anyway, but it would be
one less thing to worry about once we get 6-level page tables).

Jan
Yu Zhang March 23, 2017, 3:23 a.m. UTC | #5
On 3/22/2017 10:29 PM, Jan Beulich wrote:
>>>> On 21.03.17 at 03:52, <yu.c.zhang@linux.intel.com> wrote:
>> --- a/xen/arch/x86/hvm/ioreq.c
>> +++ b/xen/arch/x86/hvm/ioreq.c
>> @@ -949,6 +949,14 @@ int hvm_map_mem_type_to_ioreq_server(struct domain *d, ioservid_t id,
>>   
>>       spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
>>   
>> +    if ( rc == 0 && flags == 0 )
>> +    {
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +
>> +        if ( read_atomic(&p2m->ioreq.entry_count) )
>> +            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>> +    }
> If you do this after dropping the lock, don't you risk a race with
> another server mapping the type to itself?

I believe it's OK. Remaining p2m_ioreq_server entries still needs to be 
cleaned anyway.

>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m,
>> unsigned long gfn)
>>                       e.ipat = ipat;
>>                       if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>                       {
>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>> +                         {
>> +                             p2m->ioreq.entry_count--;
>> +                             ASSERT(p2m->ioreq.entry_count >= 0);
> If you did the ASSERT() first (using > 0), you wouldn't need the
> type be a signed one, doubling the valid value range (even if
> right now the full 64 bits can't be used anyway, but it would be
> one less thing to worry about once we get 6-level page tables).

Well, entry_count counts only for 4K pages, so even if the guest 
physical address
width is extended up to 64 bit in the future, entry_count will not 
exceed 2^52(
2^64/2^12).

Yu
> Jan
>
>
Jan Beulich March 23, 2017, 9 a.m. UTC | #6
>>> On 23.03.17 at 04:23, <yu.c.zhang@linux.intel.com> wrote:
> On 3/22/2017 10:29 PM, Jan Beulich wrote:
>>>>> On 21.03.17 at 03:52, <yu.c.zhang@linux.intel.com> wrote:
>>> --- a/xen/arch/x86/hvm/ioreq.c
>>> +++ b/xen/arch/x86/hvm/ioreq.c
>>> @@ -949,6 +949,14 @@ int hvm_map_mem_type_to_ioreq_server(struct domain *d, ioservid_t id,
>>>   
>>>       spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
>>>   
>>> +    if ( rc == 0 && flags == 0 )
>>> +    {
>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>> +
>>> +        if ( read_atomic(&p2m->ioreq.entry_count) )
>>> +            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>>> +    }
>> If you do this after dropping the lock, don't you risk a race with
>> another server mapping the type to itself?
> 
> I believe it's OK. Remaining p2m_ioreq_server entries still needs to be 
> cleaned anyway.

Are you refusing a new server mapping the type before being
done with the cleanup?

>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m,
>>> unsigned long gfn)
>>>                       e.ipat = ipat;
>>>                       if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>                       {
>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>> +                         {
>>> +                             p2m->ioreq.entry_count--;
>>> +                             ASSERT(p2m->ioreq.entry_count >= 0);
>> If you did the ASSERT() first (using > 0), you wouldn't need the
>> type be a signed one, doubling the valid value range (even if
>> right now the full 64 bits can't be used anyway, but it would be
>> one less thing to worry about once we get 6-level page tables).
> 
> Well, entry_count counts only for 4K pages, so even if the guest 
> physical address
> width is extended up to 64 bit in the future, entry_count will not 
> exceed 2^52(
> 2^64/2^12).

Oh, true. Still I'd prefer if you used an unsigned type for a count
when that's easily possible.

Jan
Yu Zhang March 24, 2017, 9:05 a.m. UTC | #7
On 3/23/2017 5:00 PM, Jan Beulich wrote:
>>>> On 23.03.17 at 04:23, <yu.c.zhang@linux.intel.com> wrote:
>> On 3/22/2017 10:29 PM, Jan Beulich wrote:
>>>>>> On 21.03.17 at 03:52, <yu.c.zhang@linux.intel.com> wrote:
>>>> --- a/xen/arch/x86/hvm/ioreq.c
>>>> +++ b/xen/arch/x86/hvm/ioreq.c
>>>> @@ -949,6 +949,14 @@ int hvm_map_mem_type_to_ioreq_server(struct domain *d, ioservid_t id,
>>>>    
>>>>        spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
>>>>    
>>>> +    if ( rc == 0 && flags == 0 )
>>>> +    {
>>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>> +
>>>> +        if ( read_atomic(&p2m->ioreq.entry_count) )
>>>> +            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>>>> +    }
>>> If you do this after dropping the lock, don't you risk a race with
>>> another server mapping the type to itself?
>> I believe it's OK. Remaining p2m_ioreq_server entries still needs to be
>> cleaned anyway.
> Are you refusing a new server mapping the type before being
> done with the cleanup?

No. I meant even a new server is mapped, we can still sweep the p2m 
table later asynchronously.
But this reminds me other point - will a dm op be interrupted by another 
one, or should it?
Since we have patch 5/5 which sweep the p2m table right after the unmap 
happens, maybe
we should refuse any mapping requirement if there's remaining 
p2m_ioreq_server entries.

>
>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m,
>>>> unsigned long gfn)
>>>>                        e.ipat = ipat;
>>>>                        if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>>                        {
>>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>>> +                         {
>>>> +                             p2m->ioreq.entry_count--;
>>>> +                             ASSERT(p2m->ioreq.entry_count >= 0);
>>> If you did the ASSERT() first (using > 0), you wouldn't need the
>>> type be a signed one, doubling the valid value range (even if
>>> right now the full 64 bits can't be used anyway, but it would be
>>> one less thing to worry about once we get 6-level page tables).
>> Well, entry_count counts only for 4K pages, so even if the guest
>> physical address
>> width is extended up to 64 bit in the future, entry_count will not
>> exceed 2^52(
>> 2^64/2^12).
> Oh, true. Still I'd prefer if you used an unsigned type for a count
> when that's easily possible.

Got it. :)

Yu
> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
Tian, Kevin March 24, 2017, 9:37 a.m. UTC | #8
> From: Yu Zhang [mailto:yu.c.zhang@linux.intel.com]

> Sent: Wednesday, March 22, 2017 6:12 PM

> 

> On 3/22/2017 4:10 PM, Tian, Kevin wrote:

> >> From: Yu Zhang [mailto:yu.c.zhang@linux.intel.com]

> >> Sent: Tuesday, March 21, 2017 10:53 AM

> >>

> >> After an ioreq server has unmapped, the remaining p2m_ioreq_server

> >> entries need to be reset back to p2m_ram_rw. This patch does this

> >> asynchronously with the current p2m_change_entry_type_global()

> interface.

> >>

> >> This patch also disallows live migration, when there's still any

> >> outstanding p2m_ioreq_server entry left. The core reason is our

> >> current implementation of p2m_change_entry_type_global() can not tell

> >> the state of p2m_ioreq_server entries(can not decide if an entry is

> >> to be emulated or to be resynced).

> > Don't quite get this point. change_global is triggered only upon

> > unmap. At that point there is no ioreq server to emulate the write

> > operations on those entries. All the things required is just to change

> > the type. What's the exact decision required here?

> 

> Well, one situation I can recall is that if another ioreq server maps to this

> type, and live migration happens later. The resolve_misconfig() code cannot

> differentiate if an p2m_ioreq_server page is an obsolete to be synced, or a

> new one to be only emulated.


so if you disallow another mapping before obsolete pages are synced
as you just replied in another mail, then such limitation would be gone?

> 

> I gave some explanation on this issue in discussion during Jun 20 - 22 last

> year.

> 

> http://lists.xenproject.org/archives/html/xen-devel/2016-06/msg02426.html

> on Jun 20

> and

> http://lists.xenproject.org/archives/html/xen-devel/2016-06/msg02575.html

> on Jun 21

> 

> > btw does it mean that live migration can be still supported as long as

> > device model proactively unmaps write-protected pages before starting

> > live migration?

> >

> 

> Yes.


I'm not sure whether there'll be a sequence issue. I assume toolstack
will first request entering logdirty mode, do iterative memory copy,
and then stop VM including its virtual devices and then build final
image (including vGPU state). XenGT supports live migration today. 
vGPU device model is notified for do state save/restore only in the 
last step of that flow (as part of Qemu save/restore). If your design 
requires vGPU device model to first unmaps write-protected pages 
(which means incapable of serving more request from guest which 
is equivalently to stop vGPU) before toolstack enters logdirty mode, 
I'm worried the required changes to the whole live migration flow...

Thanks
Kevin
Jan Beulich March 24, 2017, 10:37 a.m. UTC | #9
>>> On 24.03.17 at 10:05, <yu.c.zhang@linux.intel.com> wrote:

> 
> On 3/23/2017 5:00 PM, Jan Beulich wrote:
>>>>> On 23.03.17 at 04:23, <yu.c.zhang@linux.intel.com> wrote:
>>> On 3/22/2017 10:29 PM, Jan Beulich wrote:
>>>>>>> On 21.03.17 at 03:52, <yu.c.zhang@linux.intel.com> wrote:
>>>>> --- a/xen/arch/x86/hvm/ioreq.c
>>>>> +++ b/xen/arch/x86/hvm/ioreq.c
>>>>> @@ -949,6 +949,14 @@ int hvm_map_mem_type_to_ioreq_server(struct domain *d, 
> ioservid_t id,
>>>>>    
>>>>>        spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
>>>>>    
>>>>> +    if ( rc == 0 && flags == 0 )
>>>>> +    {
>>>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>>> +
>>>>> +        if ( read_atomic(&p2m->ioreq.entry_count) )
>>>>> +            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>>>>> +    }
>>>> If you do this after dropping the lock, don't you risk a race with
>>>> another server mapping the type to itself?
>>> I believe it's OK. Remaining p2m_ioreq_server entries still needs to be
>>> cleaned anyway.
>> Are you refusing a new server mapping the type before being
>> done with the cleanup?
> 
> No. I meant even a new server is mapped, we can still sweep the p2m 
> table later asynchronously.
> But this reminds me other point - will a dm op be interrupted by another 
> one, or should it?

Interrupted? Two of them may run in parallel on different CPUs,
against the same target domain.

> Since we have patch 5/5 which sweep the p2m table right after the unmap 
> happens, maybe
> we should refuse any mapping requirement if there's remaining 
> p2m_ioreq_server entries.

That's what I've tried to hint at with my question.

Jan
Yu Zhang March 24, 2017, 12:36 p.m. UTC | #10
On 3/24/2017 6:37 PM, Jan Beulich wrote:
>>>> On 24.03.17 at 10:05, <yu.c.zhang@linux.intel.com> wrote:
>> On 3/23/2017 5:00 PM, Jan Beulich wrote:
>>>>>> On 23.03.17 at 04:23, <yu.c.zhang@linux.intel.com> wrote:
>>>> On 3/22/2017 10:29 PM, Jan Beulich wrote:
>>>>>>>> On 21.03.17 at 03:52, <yu.c.zhang@linux.intel.com> wrote:
>>>>>> --- a/xen/arch/x86/hvm/ioreq.c
>>>>>> +++ b/xen/arch/x86/hvm/ioreq.c
>>>>>> @@ -949,6 +949,14 @@ int hvm_map_mem_type_to_ioreq_server(struct domain *d,
>> ioservid_t id,
>>>>>>     
>>>>>>         spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
>>>>>>     
>>>>>> +    if ( rc == 0 && flags == 0 )
>>>>>> +    {
>>>>>> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>>>> +
>>>>>> +        if ( read_atomic(&p2m->ioreq.entry_count) )
>>>>>> +            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>>>>>> +    }
>>>>> If you do this after dropping the lock, don't you risk a race with
>>>>> another server mapping the type to itself?
>>>> I believe it's OK. Remaining p2m_ioreq_server entries still needs to be
>>>> cleaned anyway.
>>> Are you refusing a new server mapping the type before being
>>> done with the cleanup?
>> No. I meant even a new server is mapped, we can still sweep the p2m
>> table later asynchronously.
>> But this reminds me other point - will a dm op be interrupted by another
>> one, or should it?
> Interrupted? Two of them may run in parallel on different CPUs,
> against the same target domain.

Right. That's possible.

>> Since we have patch 5/5 which sweep the p2m table right after the unmap
>> happens, maybe
>> we should refuse any mapping requirement if there's remaining
>> p2m_ioreq_server entries.
> That's what I've tried to hint at with my question.

Oh. I see. Thank you, Jan. :-)

Yu
> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
Yu Zhang March 24, 2017, 12:45 p.m. UTC | #11
On 3/24/2017 5:37 PM, Tian, Kevin wrote:
>> From: Yu Zhang [mailto:yu.c.zhang@linux.intel.com]
>> Sent: Wednesday, March 22, 2017 6:12 PM
>>
>> On 3/22/2017 4:10 PM, Tian, Kevin wrote:
>>>> From: Yu Zhang [mailto:yu.c.zhang@linux.intel.com]
>>>> Sent: Tuesday, March 21, 2017 10:53 AM
>>>>
>>>> After an ioreq server has unmapped, the remaining p2m_ioreq_server
>>>> entries need to be reset back to p2m_ram_rw. This patch does this
>>>> asynchronously with the current p2m_change_entry_type_global()
>> interface.
>>>> This patch also disallows live migration, when there's still any
>>>> outstanding p2m_ioreq_server entry left. The core reason is our
>>>> current implementation of p2m_change_entry_type_global() can not tell
>>>> the state of p2m_ioreq_server entries(can not decide if an entry is
>>>> to be emulated or to be resynced).
>>> Don't quite get this point. change_global is triggered only upon
>>> unmap. At that point there is no ioreq server to emulate the write
>>> operations on those entries. All the things required is just to change
>>> the type. What's the exact decision required here?
>> Well, one situation I can recall is that if another ioreq server maps to this
>> type, and live migration happens later. The resolve_misconfig() code cannot
>> differentiate if an p2m_ioreq_server page is an obsolete to be synced, or a
>> new one to be only emulated.
> so if you disallow another mapping before obsolete pages are synced
> as you just replied in another mail, then such limitation would be gone?

Well, it may still have problems.

Even we know the remaining p2m_ioreq_server is an definitely an outdated 
one, resolve_misconfig()
still lacks information to decide if this p2m type is supposed to reset 
to p2m_ram_rw(when in a p2m
sweeping), or it shall be marked as p2m_log_dirty(during a live 
migration process). Current code
in resolve_misconfig() lacks such information.

I mean, we surely can reset these pages to p2m_log_dirty directly if 
global_logdirty is on. But I do not
think this is the correct thing, although these pages will be reset back 
to p2m_ram_rw later in the ept
violation handling process, it might cause some clean pages(which were 
write protected once, but no
longer now) to be tracked and be sent to the target later if live 
migration is triggered.

>> I gave some explanation on this issue in discussion during Jun 20 - 22 last
>> year.
>>
>> http://lists.xenproject.org/archives/html/xen-devel/2016-06/msg02426.html
>> on Jun 20
>> and
>> http://lists.xenproject.org/archives/html/xen-devel/2016-06/msg02575.html
>> on Jun 21
>>
>>> btw does it mean that live migration can be still supported as long as
>>> device model proactively unmaps write-protected pages before starting
>>> live migration?
>>>
>> Yes.
> I'm not sure whether there'll be a sequence issue. I assume toolstack
> will first request entering logdirty mode, do iterative memory copy,
> and then stop VM including its virtual devices and then build final
> image (including vGPU state). XenGT supports live migration today.
> vGPU device model is notified for do state save/restore only in the
> last step of that flow (as part of Qemu save/restore). If your design
> requires vGPU device model to first unmaps write-protected pages
> (which means incapable of serving more request from guest which
> is equivalently to stop vGPU) before toolstack enters logdirty mode,
> I'm worried the required changes to the whole live migration flow...

Well, previously, George has written a draft patch to solve this issue 
and make the lazy p2m
change code more generic(with more information added in the p2m 
structure). We believed it
may also solve the live migration restriction(with no changes in the 
interface between the
hypervisor and device model). But there's some bugs, neither of us have 
enough time to debug.
So I'd like to submit our current code first, and with that solution 
being mature, we can remove
the live migration restriction. :-)

Thanks
Yu

> Thanks
> Kevin
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index 746799f..102c6c2 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -949,6 +949,14 @@  int hvm_map_mem_type_to_ioreq_server(struct domain *d, ioservid_t id,
 
     spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
 
+    if ( rc == 0 && flags == 0 )
+    {
+        struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+        if ( read_atomic(&p2m->ioreq.entry_count) )
+            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
+    }
+
     return rc;
 }
 
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index a57b385..6ec950a 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -187,6 +187,15 @@  out:
  */
 static int hap_enable_log_dirty(struct domain *d, bool_t log_global)
 {
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    /*
+     * Refuse to turn on global log-dirty mode if
+     * there's outstanding p2m_ioreq_server pages.
+     */
+    if ( log_global && read_atomic(&p2m->ioreq.entry_count) )
+        return -EBUSY;
+
     /* turn on PG_log_dirty bit in paging mode */
     paging_lock(d);
     d->arch.paging.mode |= PG_log_dirty;
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index cc1eb21..1df3d09 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -544,6 +544,12 @@  static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
                     e.ipat = ipat;
                     if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
                     {
+                         if ( e.sa_p2mt == p2m_ioreq_server )
+                         {
+                             p2m->ioreq.entry_count--;
+                             ASSERT(p2m->ioreq.entry_count >= 0);
+                         }
+
                          e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
                                      ? p2m_ram_logdirty : p2m_ram_rw;
                          ept_p2m_type_to_flags(p2m, &e, e.sa_p2mt, e.access);
@@ -965,7 +971,7 @@  static mfn_t ept_get_entry(struct p2m_domain *p2m,
     if ( is_epte_valid(ept_entry) )
     {
         if ( (recalc || ept_entry->recalc) &&
-             p2m_is_changeable(ept_entry->sa_p2mt) )
+             p2m_check_changeable(ept_entry->sa_p2mt) )
             *t = p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
                                                       : p2m_ram_rw;
         else
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index f6c45ec..169de75 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -436,11 +436,13 @@  static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
          needs_recalc(l1, *pent) )
     {
         l1_pgentry_t e = *pent;
+        p2m_type_t p2mt_old;
 
         if ( !valid_recalc(l1, e) )
             P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
                       p2m->domain->domain_id, gfn, level);
-        if ( p2m_is_changeable(p2m_flags_to_type(l1e_get_flags(e))) )
+        p2mt_old = p2m_flags_to_type(l1e_get_flags(e));
+        if ( p2m_is_changeable(p2mt_old) )
         {
             unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
             p2m_type_t p2mt = p2m_is_logdirty_range(p2m, gfn & mask, gfn | ~mask)
@@ -460,6 +462,13 @@  static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
                      mfn &= ~(_PAGE_PSE_PAT >> PAGE_SHIFT);
                 flags |= _PAGE_PSE;
             }
+
+            if ( p2mt_old == p2m_ioreq_server )
+            {
+                p2m->ioreq.entry_count--;
+                ASSERT(p2m->ioreq.entry_count >= 0);
+            }
+
             e = l1e_from_pfn(mfn, flags);
             p2m_add_iommu_flags(&e, level,
                                 (p2mt == p2m_ram_rw)
@@ -729,7 +738,7 @@  p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
 static inline p2m_type_t recalc_type(bool_t recalc, p2m_type_t t,
                                      struct p2m_domain *p2m, unsigned long gfn)
 {
-    if ( !recalc || !p2m_is_changeable(t) )
+    if ( !recalc || !p2m_check_changeable(t) )
         return t;
     return p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
                                                 : p2m_ram_rw;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index dd4e477..e3e54f1 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -954,6 +954,26 @@  int p2m_change_type_one(struct domain *d, unsigned long gfn,
                          p2m->default_access)
          : -EBUSY;
 
+    if ( !rc )
+    {
+        switch ( nt )
+        {
+        case p2m_ram_rw:
+            if ( ot == p2m_ioreq_server )
+            {
+                p2m->ioreq.entry_count--;
+                ASSERT(p2m->ioreq.entry_count >= 0);
+            }
+            break;
+        case p2m_ioreq_server:
+            ASSERT(ot == p2m_ram_rw);
+            p2m->ioreq.entry_count++;
+            break;
+        default:
+            break;
+        }
+    }
+
     gfn_unlock(p2m, gfn, 0);
 
     return rc;
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 3786680..395f125 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -120,7 +120,10 @@  typedef unsigned int p2m_query_t;
 
 /* Types that can be subject to bulk transitions. */
 #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
-                              | p2m_to_mask(p2m_ram_logdirty) )
+                              | p2m_to_mask(p2m_ram_logdirty) \
+                              | p2m_to_mask(p2m_ioreq_server) )
+
+#define P2M_IOREQ_TYPES (p2m_to_mask(p2m_ioreq_server))
 
 #define P2M_POD_TYPES (p2m_to_mask(p2m_populate_on_demand))
 
@@ -157,6 +160,7 @@  typedef unsigned int p2m_query_t;
 #define p2m_is_readonly(_t) (p2m_to_mask(_t) & P2M_RO_TYPES)
 #define p2m_is_discard_write(_t) (p2m_to_mask(_t) & P2M_DISCARD_WRITE_TYPES)
 #define p2m_is_changeable(_t) (p2m_to_mask(_t) & P2M_CHANGEABLE_TYPES)
+#define p2m_is_ioreq(_t) (p2m_to_mask(_t) & P2M_IOREQ_TYPES)
 #define p2m_is_pod(_t) (p2m_to_mask(_t) & P2M_POD_TYPES)
 #define p2m_is_grant(_t) (p2m_to_mask(_t) & P2M_GRANT_TYPES)
 /* Grant types are *not* considered valid, because they can be
@@ -178,6 +182,8 @@  typedef unsigned int p2m_query_t;
 
 #define p2m_allows_invalid_mfn(t) (p2m_to_mask(t) & P2M_INVALID_MFN_TYPES)
 
+#define p2m_check_changeable(t) (p2m_is_changeable(t) && !p2m_is_ioreq(t))
+
 typedef enum {
     p2m_host,
     p2m_nested,
@@ -349,6 +355,7 @@  struct p2m_domain {
           * are to be emulated by an ioreq server.
           */
          unsigned int flags;
+         long entry_count;
      } ioreq;
 };