diff mbox

[v12,5/6] x86/ioreq server: Asynchronously reset outstanding p2m_ioreq_server entries.

Message ID 1491494017-30743-6-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yu Zhang April 6, 2017, 3:53 p.m. UTC
After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
asynchronously with the current p2m_change_entry_type_global()
interface.

New field entry_count is introduced in struct p2m_domain, to record
the number of p2m_ioreq_server p2m page table entries. One nature of
these entries is that they only point to 4K sized page frames, because
all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
p2m_change_type_one(). We do not need to worry about the counting for
2M/1G sized pages.

This patch disallows mapping of an ioreq server, when there's still
p2m_ioreq_server entry left, in case another mapping occurs right after
the current one being unmapped, releases its lock, with p2m table not
synced yet.

This patch also disallows live migration, when there's remaining
p2m_ioreq_server entry in p2m table. The core reason is our current
implementation of p2m_change_entry_type_global() lacks information
to resync p2m_ioreq_server entries correctly if global_logdirty is
on.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
---
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>

changes in v7:
  - According to comments from George: add code to increase the
    entry_count.
  - Comment changes in {ept,p2m_pt}_set_entry.

changes in v6:
  - According to comments from Jan & George: move the count from
    p2m_change_type_one() to {ept,p2m_pt}_set_entry.
  - According to comments from George: comments change.

changes in v5:
  - According to comments from Jan: use unsigned long for entry_count;
  - According to comments from Jan: refuse mapping requirement when
    there's p2m_ioreq_server remained in p2m table.
  - Added "Reviewed-by: Paul Durrant <paul.durrant@citrix.com>"

changes in v4:
  - According to comments from Jan: use ASSERT() instead of 'if'
    condition in p2m_change_type_one().
  - According to comments from Jan: commit message changes to mention
    the p2m_ioreq_server are all based on 4K sized pages.

changes in v3:
  - Move the synchronously resetting logic into patch 5.
  - According to comments from Jan: introduce p2m_check_changeable()
    to clarify the p2m type change code.
  - According to comments from George: use locks in the same order
    to avoid deadlock, call p2m_change_entry_type_global() after unmap
    of the ioreq server is finished.

changes in v2:
  - Move the calculation of ioreq server page entry_cout into
    p2m_change_type_one() so that we do not need a seperate lock.
    Note: entry_count is also calculated in resolve_misconfig()/
    do_recalc(), fortunately callers of both routines have p2m
    lock protected already.
  - Simplify logic in hvmop_set_mem_type().
  - Introduce routine p2m_finish_type_change() to walk the p2m
    table and do the p2m reset.
---
 xen/arch/x86/hvm/ioreq.c  |  8 ++++++++
 xen/arch/x86/mm/hap/hap.c |  9 +++++++++
 xen/arch/x86/mm/p2m-ept.c | 24 +++++++++++++++++++++++-
 xen/arch/x86/mm/p2m-pt.c  | 30 ++++++++++++++++++++++++++++--
 xen/arch/x86/mm/p2m.c     |  9 +++++++++
 xen/include/asm-x86/p2m.h |  9 ++++++++-
 6 files changed, 85 insertions(+), 4 deletions(-)

Comments

George Dunlap April 6, 2017, 4:45 p.m. UTC | #1
On 06/04/17 16:53, Yu Zhang wrote:
> After an ioreq server has unmapped, the remaining p2m_ioreq_server
> entries need to be reset back to p2m_ram_rw. This patch does this
> asynchronously with the current p2m_change_entry_type_global()
> interface.
> 
> New field entry_count is introduced in struct p2m_domain, to record
> the number of p2m_ioreq_server p2m page table entries. One nature of
> these entries is that they only point to 4K sized page frames, because
> all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
> p2m_change_type_one(). We do not need to worry about the counting for
> 2M/1G sized pages.
> 
> This patch disallows mapping of an ioreq server, when there's still
> p2m_ioreq_server entry left, in case another mapping occurs right after
> the current one being unmapped, releases its lock, with p2m table not
> synced yet.
> 
> This patch also disallows live migration, when there's remaining
> p2m_ioreq_server entry in p2m table. The core reason is our current
> implementation of p2m_change_entry_type_global() lacks information
> to resync p2m_ioreq_server entries correctly if global_logdirty is
> on.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
> ---
> Cc: Paul Durrant <paul.durrant@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> 
> changes in v7:
>   - According to comments from George: add code to increase the
>     entry_count.
>   - Comment changes in {ept,p2m_pt}_set_entry.
> 
> changes in v6:
>   - According to comments from Jan & George: move the count from
>     p2m_change_type_one() to {ept,p2m_pt}_set_entry.
>   - According to comments from George: comments change.
> 
> changes in v5:
>   - According to comments from Jan: use unsigned long for entry_count;
>   - According to comments from Jan: refuse mapping requirement when
>     there's p2m_ioreq_server remained in p2m table.
>   - Added "Reviewed-by: Paul Durrant <paul.durrant@citrix.com>"
> 
> changes in v4:
>   - According to comments from Jan: use ASSERT() instead of 'if'
>     condition in p2m_change_type_one().
>   - According to comments from Jan: commit message changes to mention
>     the p2m_ioreq_server are all based on 4K sized pages.
> 
> changes in v3:
>   - Move the synchronously resetting logic into patch 5.
>   - According to comments from Jan: introduce p2m_check_changeable()
>     to clarify the p2m type change code.
>   - According to comments from George: use locks in the same order
>     to avoid deadlock, call p2m_change_entry_type_global() after unmap
>     of the ioreq server is finished.
> 
> changes in v2:
>   - Move the calculation of ioreq server page entry_cout into
>     p2m_change_type_one() so that we do not need a seperate lock.
>     Note: entry_count is also calculated in resolve_misconfig()/
>     do_recalc(), fortunately callers of both routines have p2m
>     lock protected already.
>   - Simplify logic in hvmop_set_mem_type().
>   - Introduce routine p2m_finish_type_change() to walk the p2m
>     table and do the p2m reset.
> ---
>  xen/arch/x86/hvm/ioreq.c  |  8 ++++++++
>  xen/arch/x86/mm/hap/hap.c |  9 +++++++++
>  xen/arch/x86/mm/p2m-ept.c | 24 +++++++++++++++++++++++-
>  xen/arch/x86/mm/p2m-pt.c  | 30 ++++++++++++++++++++++++++++--
>  xen/arch/x86/mm/p2m.c     |  9 +++++++++
>  xen/include/asm-x86/p2m.h |  9 ++++++++-
>  6 files changed, 85 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
> index 5bf3b6d..07a6c26 100644
> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -955,6 +955,14 @@ int hvm_map_mem_type_to_ioreq_server(struct domain *d, ioservid_t id,
>  
>      spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
>  
> +    if ( rc == 0 && flags == 0 )
> +    {
> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +        if ( read_atomic(&p2m->ioreq.entry_count) )
> +            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
> +    }
> +
>      return rc;
>  }
>  
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index a57b385..4b591fe 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -187,6 +187,15 @@ out:
>   */
>  static int hap_enable_log_dirty(struct domain *d, bool_t log_global)
>  {
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +    /*
> +     * Refuse to turn on global log-dirty mode if
> +     * there are outstanding p2m_ioreq_server pages.
> +     */
> +    if ( log_global && read_atomic(&p2m->ioreq.entry_count) )
> +        return -EBUSY;
> +
>      /* turn on PG_log_dirty bit in paging mode */
>      paging_lock(d);
>      d->arch.paging.mode |= PG_log_dirty;
> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
> index cc1eb21..ecd5ceb 100644
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
>                      e.ipat = ipat;
>                      if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>                      {
> +                         if ( e.sa_p2mt == p2m_ioreq_server )
> +                         {
> +                             ASSERT(p2m->ioreq.entry_count > 0);
> +                             p2m->ioreq.entry_count--;
> +                         }
> +
>                           e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
>                                       ? p2m_ram_logdirty : p2m_ram_rw;
>                           ept_p2m_type_to_flags(p2m, &e, e.sa_p2mt, e.access);
> @@ -816,6 +822,22 @@ ept_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>          new_entry.suppress_ve = is_epte_valid(&old_entry) ?
>                                      old_entry.suppress_ve : 1;
>  
> +    /*
> +     * p2m_ioreq_server is only used for 4K pages, so the
> +     * count shall only happen on ept page table entries.

Actually, I liked the previous comment ("We only do x...") better. :-)

FYI normally if you're trying to avoid saying "we" you use the passive
voice, like this: "...so the count *is only done* on ept page table
entries".

But I don't think that's a big deal at this point:

Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Tian, Kevin April 7, 2017, 7:34 a.m. UTC | #2
> From: Yu Zhang [mailto:yu.c.zhang@linux.intel.com]
> Sent: Thursday, April 6, 2017 11:54 PM
> 
> After an ioreq server has unmapped, the remaining p2m_ioreq_server
> entries need to be reset back to p2m_ram_rw. This patch does this
> asynchronously with the current p2m_change_entry_type_global()
> interface.
> 
> New field entry_count is introduced in struct p2m_domain, to record
> the number of p2m_ioreq_server p2m page table entries. One nature of
> these entries is that they only point to 4K sized page frames, because
> all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
> p2m_change_type_one(). We do not need to worry about the counting for
> 2M/1G sized pages.
> 
> This patch disallows mapping of an ioreq server, when there's still
> p2m_ioreq_server entry left, in case another mapping occurs right after
> the current one being unmapped, releases its lock, with p2m table not
> synced yet.
> 
> This patch also disallows live migration, when there's remaining
> p2m_ioreq_server entry in p2m table. The core reason is our current
> implementation of p2m_change_entry_type_global() lacks information
> to resync p2m_ioreq_server entries correctly if global_logdirty is
> on.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich April 7, 2017, 9:40 a.m. UTC | #3
>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
>                      e.ipat = ipat;
>                      if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>                      {
> +                         if ( e.sa_p2mt == p2m_ioreq_server )
> +                         {
> +                             ASSERT(p2m->ioreq.entry_count > 0);
> +                             p2m->ioreq.entry_count--;
> +                         }
> +
>                           e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
>                                       ? p2m_ram_logdirty : p2m_ram_rw;

I don't think this can be right: Why would it be valid to change the
type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
here, without taking into account further information? This code
can run at any time, not just when you want to reset things. So at
the very least there is a check missing whether a suitable ioreq
server still exists (and only if it doesn't you want to do the type
reset).

> @@ -816,6 +822,22 @@ ept_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>          new_entry.suppress_ve = is_epte_valid(&old_entry) ?
>                                      old_entry.suppress_ve : 1;
>  
> +    /*
> +     * p2m_ioreq_server is only used for 4K pages, so the
> +     * count shall only happen on ept page table entries.
> +     */
> +    if ( p2mt == p2m_ioreq_server )
> +    {
> +        ASSERT(i == 0);
> +        p2m->ioreq.entry_count++;
> +    }
> +
> +    if ( ept_entry->sa_p2mt == p2m_ioreq_server )
> +    {
> +        ASSERT(p2m->ioreq.entry_count > 0 && i == 0);

I think this would better be two ASSERT()s, so if one triggers it's
clear what problem it was right away. The two conditions aren't
really related to one another.

> @@ -965,7 +987,7 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m,
>      if ( is_epte_valid(ept_entry) )
>      {
>          if ( (recalc || ept_entry->recalc) &&
> -             p2m_is_changeable(ept_entry->sa_p2mt) )
> +             p2m_check_changeable(ept_entry->sa_p2mt) )

I think the distinction between these two is rather arbitrary, and I
also think this is part of the problem above: Distinguishing log-dirty
from ram-rw requires auxiliary data to be consulted. The same
ought to apply to ioreq-server, and then there wouldn't be a need
to have two p2m_*_changeable() flavors.

Of course the subsequent use p2m_is_logdirty_range() may then
need amending.

In the end it looks like you have the inverse problem here compared
to above: You should return ram-rw when the reset was already
initiated. At least that's how I would see the logic to match up with
the log-dirty handling (where the _effective_ rather than the last
stored type is being returned).

> @@ -606,6 +615,8 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>  
>      if ( page_order == PAGE_ORDER_4K )
>      {
> +        p2m_type_t p2mt_old;
> +
>          rc = p2m_next_level(p2m, &table, &gfn_remainder, gfn,
>                              L2_PAGETABLE_SHIFT - PAGE_SHIFT,
>                              L2_PAGETABLE_ENTRIES, PGT_l1_page_table, 1);
> @@ -629,6 +640,21 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>          if ( entry_content.l1 != 0 )
>              p2m_add_iommu_flags(&entry_content, 0, iommu_pte_flags);
>  
> +        p2mt_old = p2m_flags_to_type(l1e_get_flags(*p2m_entry));
> +
> +        /*
> +         * p2m_ioreq_server is only used for 4K pages, so
> +         * the count shall only be performed for level 1 entries.
> +         */
> +        if ( p2mt == p2m_ioreq_server )
> +            p2m->ioreq.entry_count++;
> +
> +        if ( p2mt_old == p2m_ioreq_server )
> +        {
> +            ASSERT(p2m->ioreq.entry_count > 0);
> +            p2m->ioreq.entry_count--;
> +        }
> +
>          /* level 1 entry */
>          p2m->write_p2m_entry(p2m, gfn, p2m_entry, entry_content, 1);

I think to match up with EPT you also want to add

    ASSERT(p2mt_old != p2m_ioreq_server);

to the 2M and 1G paths.

Jan
Yu Zhang April 7, 2017, 9:53 a.m. UTC | #4
On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
>>                       e.ipat = ipat;
>>                       if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>                       {
>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>> +                         {
>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>> +                             p2m->ioreq.entry_count--;
>> +                         }
>> +
>>                            e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
>>                                        ? p2m_ram_logdirty : p2m_ram_rw;
> I don't think this can be right: Why would it be valid to change the
> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
> here, without taking into account further information? This code
> can run at any time, not just when you want to reset things. So at
> the very least there is a check missing whether a suitable ioreq
> server still exists (and only if it doesn't you want to do the type
> reset).

Sorry, Jan. I think we have discussed this quite long ago.
Indeed, there's information lacked here, and that's why global_logdirty 
is disallowed
when there's remaining p2m_ioreq_server entries. :-)

>
>> @@ -816,6 +822,22 @@ ept_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>>           new_entry.suppress_ve = is_epte_valid(&old_entry) ?
>>                                       old_entry.suppress_ve : 1;
>>   
>> +    /*
>> +     * p2m_ioreq_server is only used for 4K pages, so the
>> +     * count shall only happen on ept page table entries.
>> +     */
>> +    if ( p2mt == p2m_ioreq_server )
>> +    {
>> +        ASSERT(i == 0);
>> +        p2m->ioreq.entry_count++;
>> +    }
>> +
>> +    if ( ept_entry->sa_p2mt == p2m_ioreq_server )
>> +    {
>> +        ASSERT(p2m->ioreq.entry_count > 0 && i == 0);
> I think this would better be two ASSERT()s, so if one triggers it's
> clear what problem it was right away. The two conditions aren't
> really related to one another.
>
>> @@ -965,7 +987,7 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m,
>>       if ( is_epte_valid(ept_entry) )
>>       {
>>           if ( (recalc || ept_entry->recalc) &&
>> -             p2m_is_changeable(ept_entry->sa_p2mt) )
>> +             p2m_check_changeable(ept_entry->sa_p2mt) )
> I think the distinction between these two is rather arbitrary, and I
> also think this is part of the problem above: Distinguishing log-dirty
> from ram-rw requires auxiliary data to be consulted. The same
> ought to apply to ioreq-server, and then there wouldn't be a need
> to have two p2m_*_changeable() flavors.

Well, I think we have also discussed this quite long ago, here is the link.
https://lists.xen.org/archives/html/xen-devel/2016-09/msg01017.html

> Of course the subsequent use p2m_is_logdirty_range() may then
> need amending.
>
> In the end it looks like you have the inverse problem here compared
> to above: You should return ram-rw when the reset was already
> initiated. At least that's how I would see the logic to match up with
> the log-dirty handling (where the _effective_ rather than the last
> stored type is being returned).
>
>> @@ -606,6 +615,8 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>>   
>>       if ( page_order == PAGE_ORDER_4K )
>>       {
>> +        p2m_type_t p2mt_old;
>> +
>>           rc = p2m_next_level(p2m, &table, &gfn_remainder, gfn,
>>                               L2_PAGETABLE_SHIFT - PAGE_SHIFT,
>>                               L2_PAGETABLE_ENTRIES, PGT_l1_page_table, 1);
>> @@ -629,6 +640,21 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>>           if ( entry_content.l1 != 0 )
>>               p2m_add_iommu_flags(&entry_content, 0, iommu_pte_flags);
>>   
>> +        p2mt_old = p2m_flags_to_type(l1e_get_flags(*p2m_entry));
>> +
>> +        /*
>> +         * p2m_ioreq_server is only used for 4K pages, so
>> +         * the count shall only be performed for level 1 entries.
>> +         */
>> +        if ( p2mt == p2m_ioreq_server )
>> +            p2m->ioreq.entry_count++;
>> +
>> +        if ( p2mt_old == p2m_ioreq_server )
>> +        {
>> +            ASSERT(p2m->ioreq.entry_count > 0);
>> +            p2m->ioreq.entry_count--;
>> +        }
>> +
>>           /* level 1 entry */
>>           p2m->write_p2m_entry(p2m, gfn, p2m_entry, entry_content, 1);
> I think to match up with EPT you also want to add
>
>      ASSERT(p2mt_old != p2m_ioreq_server);
>
> to the 2M and 1G paths.

Is this really necessary? 2M and 1G page does not have p2mt_old, 
defining one and peek the p2m type just
to have an ASSERT does not seem quite useful - and will hurt the 
performance.

As to ept, since there's already a variable 'i', which may be greater 
than 0 - so I added an ASSERT.

Yu
> Jan
>
Yu Zhang April 7, 2017, 10:14 a.m. UTC | #5
On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
>>                       e.ipat = ipat;
>>                       if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>                       {
>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>> +                         {
>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>> +                             p2m->ioreq.entry_count--;
>> +                         }
>> +
>>                            e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
>>                                        ? p2m_ram_logdirty : p2m_ram_rw;
> I don't think this can be right: Why would it be valid to change the
> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
> here, without taking into account further information? This code
> can run at any time, not just when you want to reset things. So at
> the very least there is a check missing whether a suitable ioreq
> server still exists (and only if it doesn't you want to do the type
> reset).

Also I do not think we need to check if a suitable ioreq server still 
exists. We have guaranteed
in our patch that no new ioreq server will be mapped as long as the p2m 
table is not clean. :)

Yu

[snip]
George Dunlap April 7, 2017, 10:22 a.m. UTC | #6
On 07/04/17 10:53, Yu Zhang wrote:
> 
> 
> On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain
>>> *p2m, unsigned long gfn)
>>>                       e.ipat = ipat;
>>>                       if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>                       {
>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>> +                         {
>>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>>> +                             p2m->ioreq.entry_count--;
>>> +                         }
>>> +
>>>                            e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn
>>> + i, gfn + i)
>>>                                        ? p2m_ram_logdirty : p2m_ram_rw;
>> I don't think this can be right: Why would it be valid to change the
>> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
>> here, without taking into account further information? This code
>> can run at any time, not just when you want to reset things. So at
>> the very least there is a check missing whether a suitable ioreq
>> server still exists (and only if it doesn't you want to do the type
>> reset).
> 
> Sorry, Jan. I think we have discussed this quite long ago.
> Indeed, there's information lacked here, and that's why global_logdirty
> is disallowed
> when there's remaining p2m_ioreq_server entries. :-)
> 
>>
>>> @@ -816,6 +822,22 @@ ept_set_entry(struct p2m_domain *p2m, unsigned
>>> long gfn, mfn_t mfn,
>>>           new_entry.suppress_ve = is_epte_valid(&old_entry) ?
>>>                                       old_entry.suppress_ve : 1;
>>>   +    /*
>>> +     * p2m_ioreq_server is only used for 4K pages, so the
>>> +     * count shall only happen on ept page table entries.
>>> +     */
>>> +    if ( p2mt == p2m_ioreq_server )
>>> +    {
>>> +        ASSERT(i == 0);
>>> +        p2m->ioreq.entry_count++;
>>> +    }
>>> +
>>> +    if ( ept_entry->sa_p2mt == p2m_ioreq_server )
>>> +    {
>>> +        ASSERT(p2m->ioreq.entry_count > 0 && i == 0);
>> I think this would better be two ASSERT()s, so if one triggers it's
>> clear what problem it was right away. The two conditions aren't
>> really related to one another.
>>
>>> @@ -965,7 +987,7 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m,
>>>       if ( is_epte_valid(ept_entry) )
>>>       {
>>>           if ( (recalc || ept_entry->recalc) &&
>>> -             p2m_is_changeable(ept_entry->sa_p2mt) )
>>> +             p2m_check_changeable(ept_entry->sa_p2mt) )
>> I think the distinction between these two is rather arbitrary, and I
>> also think this is part of the problem above: Distinguishing log-dirty
>> from ram-rw requires auxiliary data to be consulted. The same
>> ought to apply to ioreq-server, and then there wouldn't be a need
>> to have two p2m_*_changeable() flavors.
> 
> Well, I think we have also discussed this quite long ago, here is the link.
> https://lists.xen.org/archives/html/xen-devel/2016-09/msg01017.html
> 
>> Of course the subsequent use p2m_is_logdirty_range() may then
>> need amending.
>>
>> In the end it looks like you have the inverse problem here compared
>> to above: You should return ram-rw when the reset was already
>> initiated. At least that's how I would see the logic to match up with
>> the log-dirty handling (where the _effective_ rather than the last
>> stored type is being returned).
>>
>>> @@ -606,6 +615,8 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned
>>> long gfn, mfn_t mfn,
>>>         if ( page_order == PAGE_ORDER_4K )
>>>       {
>>> +        p2m_type_t p2mt_old;
>>> +
>>>           rc = p2m_next_level(p2m, &table, &gfn_remainder, gfn,
>>>                               L2_PAGETABLE_SHIFT - PAGE_SHIFT,
>>>                               L2_PAGETABLE_ENTRIES,
>>> PGT_l1_page_table, 1);
>>> @@ -629,6 +640,21 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
>>> unsigned long gfn, mfn_t mfn,
>>>           if ( entry_content.l1 != 0 )
>>>               p2m_add_iommu_flags(&entry_content, 0, iommu_pte_flags);
>>>   +        p2mt_old = p2m_flags_to_type(l1e_get_flags(*p2m_entry));
>>> +
>>> +        /*
>>> +         * p2m_ioreq_server is only used for 4K pages, so
>>> +         * the count shall only be performed for level 1 entries.
>>> +         */
>>> +        if ( p2mt == p2m_ioreq_server )
>>> +            p2m->ioreq.entry_count++;
>>> +
>>> +        if ( p2mt_old == p2m_ioreq_server )
>>> +        {
>>> +            ASSERT(p2m->ioreq.entry_count > 0);
>>> +            p2m->ioreq.entry_count--;
>>> +        }
>>> +
>>>           /* level 1 entry */
>>>           p2m->write_p2m_entry(p2m, gfn, p2m_entry, entry_content, 1);
>> I think to match up with EPT you also want to add
>>
>>      ASSERT(p2mt_old != p2m_ioreq_server);
>>
>> to the 2M and 1G paths.
> 
> Is this really necessary? 2M and 1G page does not have p2mt_old,
> defining one and peek the p2m type just
> to have an ASSERT does not seem quite useful - and will hurt the
> performance.
> 
> As to ept, since there's already a variable 'i', which may be greater
> than 0 - so I added an ASSERT.

Yes, that's Jan's point -- that for EPT, there is effectively ASSERT()
that 2M and 1G entries are not p2m_ioreq_server; but for SVM, because of
the code duplication, there is not.

ASSERT()s are:
1. There to double-check that the assumptions you're making (i.e., "2M
and 1G entries can never be of type p2m_ioreq_server") are valid
2. Only enabled when debug=y, and so are generally not a performance
consideration.

You're making an assumption, so an ASSERT is useful; and it's only a
one-line check that will be removed for non-debug builds, so the
performance is not a consideration.

 -George
Yu Zhang April 7, 2017, 10:22 a.m. UTC | #7
On 4/7/2017 6:22 PM, George Dunlap wrote:
> On 07/04/17 10:53, Yu Zhang wrote:
>>
>> On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain
>>>> *p2m, unsigned long gfn)
>>>>                        e.ipat = ipat;
>>>>                        if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>>                        {
>>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>>> +                         {
>>>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>>>> +                             p2m->ioreq.entry_count--;
>>>> +                         }
>>>> +
>>>>                             e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn
>>>> + i, gfn + i)
>>>>                                         ? p2m_ram_logdirty : p2m_ram_rw;
>>> I don't think this can be right: Why would it be valid to change the
>>> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
>>> here, without taking into account further information? This code
>>> can run at any time, not just when you want to reset things. So at
>>> the very least there is a check missing whether a suitable ioreq
>>> server still exists (and only if it doesn't you want to do the type
>>> reset).
>> Sorry, Jan. I think we have discussed this quite long ago.
>> Indeed, there's information lacked here, and that's why global_logdirty
>> is disallowed
>> when there's remaining p2m_ioreq_server entries. :-)
>>
>>>> @@ -816,6 +822,22 @@ ept_set_entry(struct p2m_domain *p2m, unsigned
>>>> long gfn, mfn_t mfn,
>>>>            new_entry.suppress_ve = is_epte_valid(&old_entry) ?
>>>>                                        old_entry.suppress_ve : 1;
>>>>    +    /*
>>>> +     * p2m_ioreq_server is only used for 4K pages, so the
>>>> +     * count shall only happen on ept page table entries.
>>>> +     */
>>>> +    if ( p2mt == p2m_ioreq_server )
>>>> +    {
>>>> +        ASSERT(i == 0);
>>>> +        p2m->ioreq.entry_count++;
>>>> +    }
>>>> +
>>>> +    if ( ept_entry->sa_p2mt == p2m_ioreq_server )
>>>> +    {
>>>> +        ASSERT(p2m->ioreq.entry_count > 0 && i == 0);
>>> I think this would better be two ASSERT()s, so if one triggers it's
>>> clear what problem it was right away. The two conditions aren't
>>> really related to one another.
>>>
>>>> @@ -965,7 +987,7 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m,
>>>>        if ( is_epte_valid(ept_entry) )
>>>>        {
>>>>            if ( (recalc || ept_entry->recalc) &&
>>>> -             p2m_is_changeable(ept_entry->sa_p2mt) )
>>>> +             p2m_check_changeable(ept_entry->sa_p2mt) )
>>> I think the distinction between these two is rather arbitrary, and I
>>> also think this is part of the problem above: Distinguishing log-dirty
>>> from ram-rw requires auxiliary data to be consulted. The same
>>> ought to apply to ioreq-server, and then there wouldn't be a need
>>> to have two p2m_*_changeable() flavors.
>> Well, I think we have also discussed this quite long ago, here is the link.
>> https://lists.xen.org/archives/html/xen-devel/2016-09/msg01017.html
>>
>>> Of course the subsequent use p2m_is_logdirty_range() may then
>>> need amending.
>>>
>>> In the end it looks like you have the inverse problem here compared
>>> to above: You should return ram-rw when the reset was already
>>> initiated. At least that's how I would see the logic to match up with
>>> the log-dirty handling (where the _effective_ rather than the last
>>> stored type is being returned).
>>>
>>>> @@ -606,6 +615,8 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned
>>>> long gfn, mfn_t mfn,
>>>>          if ( page_order == PAGE_ORDER_4K )
>>>>        {
>>>> +        p2m_type_t p2mt_old;
>>>> +
>>>>            rc = p2m_next_level(p2m, &table, &gfn_remainder, gfn,
>>>>                                L2_PAGETABLE_SHIFT - PAGE_SHIFT,
>>>>                                L2_PAGETABLE_ENTRIES,
>>>> PGT_l1_page_table, 1);
>>>> @@ -629,6 +640,21 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
>>>> unsigned long gfn, mfn_t mfn,
>>>>            if ( entry_content.l1 != 0 )
>>>>                p2m_add_iommu_flags(&entry_content, 0, iommu_pte_flags);
>>>>    +        p2mt_old = p2m_flags_to_type(l1e_get_flags(*p2m_entry));
>>>> +
>>>> +        /*
>>>> +         * p2m_ioreq_server is only used for 4K pages, so
>>>> +         * the count shall only be performed for level 1 entries.
>>>> +         */
>>>> +        if ( p2mt == p2m_ioreq_server )
>>>> +            p2m->ioreq.entry_count++;
>>>> +
>>>> +        if ( p2mt_old == p2m_ioreq_server )
>>>> +        {
>>>> +            ASSERT(p2m->ioreq.entry_count > 0);
>>>> +            p2m->ioreq.entry_count--;
>>>> +        }
>>>> +
>>>>            /* level 1 entry */
>>>>            p2m->write_p2m_entry(p2m, gfn, p2m_entry, entry_content, 1);
>>> I think to match up with EPT you also want to add
>>>
>>>       ASSERT(p2mt_old != p2m_ioreq_server);
>>>
>>> to the 2M and 1G paths.
>> Is this really necessary? 2M and 1G page does not have p2mt_old,
>> defining one and peek the p2m type just
>> to have an ASSERT does not seem quite useful - and will hurt the
>> performance.
>>
>> As to ept, since there's already a variable 'i', which may be greater
>> than 0 - so I added an ASSERT.
> Yes, that's Jan's point -- that for EPT, there is effectively ASSERT()
> that 2M and 1G entries are not p2m_ioreq_server; but for SVM, because of
> the code duplication, there is not.
>
> ASSERT()s are:
> 1. There to double-check that the assumptions you're making (i.e., "2M
> and 1G entries can never be of type p2m_ioreq_server") are valid
> 2. Only enabled when debug=y, and so are generally not a performance
> consideration.
>
> You're making an assumption, so an ASSERT is useful; and it's only a
> one-line check that will be removed for non-debug builds, so the
> performance is not a consideration.

Thanks George.
I do not worry about the cost of the ASSERT() itself, but the effort of 
peeking a p2m:
p2m_flags_to_type(l1e_get_flags(*p2m_entry));
And this cannot be  removed during runtime.

Yu

>   -George
>
>
Jan Beulich April 7, 2017, 10:26 a.m. UTC | #8
>>> On 07.04.17 at 11:53, <yu.c.zhang@linux.intel.com> wrote:
> On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
>>>                       e.ipat = ipat;
>>>                       if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>                       {
>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>> +                         {
>>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>>> +                             p2m->ioreq.entry_count--;
>>> +                         }
>>> +
>>>                            e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
>>>                                        ? p2m_ram_logdirty : p2m_ram_rw;
>> I don't think this can be right: Why would it be valid to change the
>> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
>> here, without taking into account further information? This code
>> can run at any time, not just when you want to reset things. So at
>> the very least there is a check missing whether a suitable ioreq
>> server still exists (and only if it doesn't you want to do the type
>> reset).
> 
> Sorry, Jan. I think we have discussed this quite long ago.
> Indeed, there's information lacked here, and that's why global_logdirty 
> is disallowed
> when there's remaining p2m_ioreq_server entries. :-)

log-dirty isn't the interesting part (which is why I did put it in
parentheses). You change ioreq-server to ram-rw here
regardless of whether we're actually cleaning up after a
detached server.

>>> @@ -965,7 +987,7 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m,
>>>       if ( is_epte_valid(ept_entry) )
>>>       {
>>>           if ( (recalc || ept_entry->recalc) &&
>>> -             p2m_is_changeable(ept_entry->sa_p2mt) )
>>> +             p2m_check_changeable(ept_entry->sa_p2mt) )
>> I think the distinction between these two is rather arbitrary, and I
>> also think this is part of the problem above: Distinguishing log-dirty
>> from ram-rw requires auxiliary data to be consulted. The same
>> ought to apply to ioreq-server, and then there wouldn't be a need
>> to have two p2m_*_changeable() flavors.
> 
> Well, I think we have also discussed this quite long ago, here is the link.
> https://lists.xen.org/archives/html/xen-devel/2016-09/msg01017.html 

That was a different discussion, namely not considering this ...

>> Of course the subsequent use p2m_is_logdirty_range() may then
>> need amending.
>>
>> In the end it looks like you have the inverse problem here compared
>> to above: You should return ram-rw when the reset was already
>> initiated. At least that's how I would see the logic to match up with
>> the log-dirty handling (where the _effective_ rather than the last
>> stored type is being returned).

... at all.

>>> @@ -629,6 +640,21 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>>>           if ( entry_content.l1 != 0 )
>>>               p2m_add_iommu_flags(&entry_content, 0, iommu_pte_flags);
>>>   
>>> +        p2mt_old = p2m_flags_to_type(l1e_get_flags(*p2m_entry));
>>> +
>>> +        /*
>>> +         * p2m_ioreq_server is only used for 4K pages, so
>>> +         * the count shall only be performed for level 1 entries.
>>> +         */
>>> +        if ( p2mt == p2m_ioreq_server )
>>> +            p2m->ioreq.entry_count++;
>>> +
>>> +        if ( p2mt_old == p2m_ioreq_server )
>>> +        {
>>> +            ASSERT(p2m->ioreq.entry_count > 0);
>>> +            p2m->ioreq.entry_count--;
>>> +        }
>>> +
>>>           /* level 1 entry */
>>>           p2m->write_p2m_entry(p2m, gfn, p2m_entry, entry_content, 1);
>> I think to match up with EPT you also want to add
>>
>>      ASSERT(p2mt_old != p2m_ioreq_server);
>>
>> to the 2M and 1G paths.
> 
> Is this really necessary? 2M and 1G page does not have p2mt_old, 
> defining one and peek the p2m type just
> to have an ASSERT does not seem quite useful - and will hurt the 
> performance.

I don't follow. Matching up with the 4k case, what you'd add would
be

    ASSERT(p2m_flags_to_type(flags) != p2m_ioreq_server);

Jan
Jan Beulich April 7, 2017, 10:28 a.m. UTC | #9
>>> On 07.04.17 at 12:14, <yu.c.zhang@linux.intel.com> wrote:
> On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
>>>                       e.ipat = ipat;
>>>                       if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>                       {
>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>> +                         {
>>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>>> +                             p2m->ioreq.entry_count--;
>>> +                         }
>>> +
>>>                            e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
>>>                                        ? p2m_ram_logdirty : p2m_ram_rw;
>> I don't think this can be right: Why would it be valid to change the
>> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
>> here, without taking into account further information? This code
>> can run at any time, not just when you want to reset things. So at
>> the very least there is a check missing whether a suitable ioreq
>> server still exists (and only if it doesn't you want to do the type
>> reset).
> 
> Also I do not think we need to check if a suitable ioreq server still 
> exists. We have guaranteed
> in our patch that no new ioreq server will be mapped as long as the p2m 
> table is not clean. :)

You didn't get my point then: The problem is when there still _is_
an ioreq server. You're changing the type even in that case.

Jan
George Dunlap April 7, 2017, 10:28 a.m. UTC | #10
On 07/04/17 11:14, Yu Zhang wrote:
> 
> 
> On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain
>>> *p2m, unsigned long gfn)
>>>                       e.ipat = ipat;
>>>                       if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>                       {
>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>> +                         {
>>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>>> +                             p2m->ioreq.entry_count--;
>>> +                         }
>>> +
>>>                            e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn
>>> + i, gfn + i)
>>>                                        ? p2m_ram_logdirty : p2m_ram_rw;
>> I don't think this can be right: Why would it be valid to change the
>> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
>> here, without taking into account further information? This code
>> can run at any time, not just when you want to reset things. So at
>> the very least there is a check missing whether a suitable ioreq
>> server still exists (and only if it doesn't you want to do the type
>> reset).
> 
> Also I do not think we need to check if a suitable ioreq server still
> exists. We have guaranteed
> in our patch that no new ioreq server will be mapped as long as the p2m
> table is not clean. :)

Jan is saying that you should only change ioreq_server -> ram if there
is *not* an ioreq server; and that if this is called with an ioreq
server still active, then it must be some other change you're looking at.

The problem, though, is that misconfiguration happens in many
circumstances.  Grep for "memory_type_changed()" -- each of those
results in a recalculation of the entire p2m, which will (in the current
code) wipe out any ioreq_server entries.

 -George
Jan Beulich April 7, 2017, 10:41 a.m. UTC | #11
>>> On 07.04.17 at 12:22, <yu.c.zhang@linux.intel.com> wrote:

> 
> On 4/7/2017 6:22 PM, George Dunlap wrote:
>> On 07/04/17 10:53, Yu Zhang wrote:
>>>
>>> On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain
>>>>> *p2m, unsigned long gfn)
>>>>>                        e.ipat = ipat;
>>>>>                        if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>>>                        {
>>>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>>>> +                         {
>>>>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>>>>> +                             p2m->ioreq.entry_count--;
>>>>> +                         }
>>>>> +
>>>>>                             e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn
>>>>> + i, gfn + i)
>>>>>                                         ? p2m_ram_logdirty : p2m_ram_rw;
>>>> I don't think this can be right: Why would it be valid to change the
>>>> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
>>>> here, without taking into account further information? This code
>>>> can run at any time, not just when you want to reset things. So at
>>>> the very least there is a check missing whether a suitable ioreq
>>>> server still exists (and only if it doesn't you want to do the type
>>>> reset).
>>> Sorry, Jan. I think we have discussed this quite long ago.
>>> Indeed, there's information lacked here, and that's why global_logdirty
>>> is disallowed
>>> when there's remaining p2m_ioreq_server entries. :-)
>>>
>>>>> @@ -816,6 +822,22 @@ ept_set_entry(struct p2m_domain *p2m, unsigned
>>>>> long gfn, mfn_t mfn,
>>>>>            new_entry.suppress_ve = is_epte_valid(&old_entry) ?
>>>>>                                        old_entry.suppress_ve : 1;
>>>>>    +    /*
>>>>> +     * p2m_ioreq_server is only used for 4K pages, so the
>>>>> +     * count shall only happen on ept page table entries.
>>>>> +     */
>>>>> +    if ( p2mt == p2m_ioreq_server )
>>>>> +    {
>>>>> +        ASSERT(i == 0);
>>>>> +        p2m->ioreq.entry_count++;
>>>>> +    }
>>>>> +
>>>>> +    if ( ept_entry->sa_p2mt == p2m_ioreq_server )
>>>>> +    {
>>>>> +        ASSERT(p2m->ioreq.entry_count > 0 && i == 0);
>>>> I think this would better be two ASSERT()s, so if one triggers it's
>>>> clear what problem it was right away. The two conditions aren't
>>>> really related to one another.
>>>>
>>>>> @@ -965,7 +987,7 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m,
>>>>>        if ( is_epte_valid(ept_entry) )
>>>>>        {
>>>>>            if ( (recalc || ept_entry->recalc) &&
>>>>> -             p2m_is_changeable(ept_entry->sa_p2mt) )
>>>>> +             p2m_check_changeable(ept_entry->sa_p2mt) )
>>>> I think the distinction between these two is rather arbitrary, and I
>>>> also think this is part of the problem above: Distinguishing log-dirty
>>>> from ram-rw requires auxiliary data to be consulted. The same
>>>> ought to apply to ioreq-server, and then there wouldn't be a need
>>>> to have two p2m_*_changeable() flavors.
>>> Well, I think we have also discussed this quite long ago, here is the link.
>>> https://lists.xen.org/archives/html/xen-devel/2016-09/msg01017.html 
>>>
>>>> Of course the subsequent use p2m_is_logdirty_range() may then
>>>> need amending.
>>>>
>>>> In the end it looks like you have the inverse problem here compared
>>>> to above: You should return ram-rw when the reset was already
>>>> initiated. At least that's how I would see the logic to match up with
>>>> the log-dirty handling (where the _effective_ rather than the last
>>>> stored type is being returned).
>>>>
>>>>> @@ -606,6 +615,8 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned
>>>>> long gfn, mfn_t mfn,
>>>>>          if ( page_order == PAGE_ORDER_4K )
>>>>>        {
>>>>> +        p2m_type_t p2mt_old;
>>>>> +
>>>>>            rc = p2m_next_level(p2m, &table, &gfn_remainder, gfn,
>>>>>                                L2_PAGETABLE_SHIFT - PAGE_SHIFT,
>>>>>                                L2_PAGETABLE_ENTRIES,
>>>>> PGT_l1_page_table, 1);
>>>>> @@ -629,6 +640,21 @@ p2m_pt_set_entry(struct p2m_domain *p2m,
>>>>> unsigned long gfn, mfn_t mfn,
>>>>>            if ( entry_content.l1 != 0 )
>>>>>                p2m_add_iommu_flags(&entry_content, 0, iommu_pte_flags);
>>>>>    +        p2mt_old = p2m_flags_to_type(l1e_get_flags(*p2m_entry));
>>>>> +
>>>>> +        /*
>>>>> +         * p2m_ioreq_server is only used for 4K pages, so
>>>>> +         * the count shall only be performed for level 1 entries.
>>>>> +         */
>>>>> +        if ( p2mt == p2m_ioreq_server )
>>>>> +            p2m->ioreq.entry_count++;
>>>>> +
>>>>> +        if ( p2mt_old == p2m_ioreq_server )
>>>>> +        {
>>>>> +            ASSERT(p2m->ioreq.entry_count > 0);
>>>>> +            p2m->ioreq.entry_count--;
>>>>> +        }
>>>>> +
>>>>>            /* level 1 entry */
>>>>>            p2m->write_p2m_entry(p2m, gfn, p2m_entry, entry_content, 1);
>>>> I think to match up with EPT you also want to add
>>>>
>>>>       ASSERT(p2mt_old != p2m_ioreq_server);
>>>>
>>>> to the 2M and 1G paths.
>>> Is this really necessary? 2M and 1G page does not have p2mt_old,
>>> defining one and peek the p2m type just
>>> to have an ASSERT does not seem quite useful - and will hurt the
>>> performance.
>>>
>>> As to ept, since there's already a variable 'i', which may be greater
>>> than 0 - so I added an ASSERT.
>> Yes, that's Jan's point -- that for EPT, there is effectively ASSERT()
>> that 2M and 1G entries are not p2m_ioreq_server; but for SVM, because of
>> the code duplication, there is not.
>>
>> ASSERT()s are:
>> 1. There to double-check that the assumptions you're making (i.e., "2M
>> and 1G entries can never be of type p2m_ioreq_server") are valid
>> 2. Only enabled when debug=y, and so are generally not a performance
>> consideration.
>>
>> You're making an assumption, so an ASSERT is useful; and it's only a
>> one-line check that will be removed for non-debug builds, so the
>> performance is not a consideration.
> 
> Thanks George.
> I do not worry about the cost of the ASSERT() itself, but the effort of 
> peeking a p2m:
> p2m_flags_to_type(l1e_get_flags(*p2m_entry));
> And this cannot be  removed during runtime.

The l1e_get_flags() is not needed - both paths latch that into
"flags" already. And p2m_flags_to_type() is a simple inline
function, for which the compiler should be able to see that its
result is not used, and hence all code generated from it can be
deleted.

Jan
Yu Zhang April 7, 2017, 10:50 a.m. UTC | #12
On 4/7/2017 6:28 PM, George Dunlap wrote:
> On 07/04/17 11:14, Yu Zhang wrote:
>>
>> On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain
>>>> *p2m, unsigned long gfn)
>>>>                        e.ipat = ipat;
>>>>                        if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>>                        {
>>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>>> +                         {
>>>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>>>> +                             p2m->ioreq.entry_count--;
>>>> +                         }
>>>> +
>>>>                             e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn
>>>> + i, gfn + i)
>>>>                                         ? p2m_ram_logdirty : p2m_ram_rw;
>>> I don't think this can be right: Why would it be valid to change the
>>> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
>>> here, without taking into account further information? This code
>>> can run at any time, not just when you want to reset things. So at
>>> the very least there is a check missing whether a suitable ioreq
>>> server still exists (and only if it doesn't you want to do the type
>>> reset).
>> Also I do not think we need to check if a suitable ioreq server still
>> exists. We have guaranteed
>> in our patch that no new ioreq server will be mapped as long as the p2m
>> table is not clean. :)
> Jan is saying that you should only change ioreq_server -> ram if there
> is *not* an ioreq server; and that if this is called with an ioreq
> server still active, then it must be some other change you're looking at.
>
> The problem, though, is that misconfiguration happens in many
> circumstances.  Grep for "memory_type_changed()" -- each of those
> results in a recalculation of the entire p2m, which will (in the current
> code) wipe out any ioreq_server entries.

Well, I'm not aware that other actions besides the logdirty will cause 
the reset.
But if that would happen, will below change solve this?

@@ -546,12 +546,16 @@ static int resolve_misconfig(struct p2m_domain 
*p2m, unsigned long gfn)
                      {
                           if ( e.sa_p2mt == p2m_ioreq_server )
                           {
-                             ASSERT(p2m->ioreq.entry_count > 0);
-                             p2m->ioreq.entry_count--;
+                             if ( p2m->ioreq.server == NULL )
+                             {
+                                 ASSERT(p2m->ioreq.entry_count > 0);
+                                 p2m->ioreq.entry_count--;
+                                 e.sa_p2mt = p2m_ram_rw;
+                             }
                           }
-
-                         e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + 
i, gfn + i)
-                                     ? p2m_ram_logdirty : p2m_ram_rw;
+                         else
+                             e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn 
+ i, gfn + i)
+                                         ? p2m_ram_logdirty : p2m_ram_rw;
                           ept_p2m_type_to_flags(p2m, &e, e.sa_p2mt, 
e.access);
                      }
                      e.recalc = 0;

Yu
>
>   -George
>
Yu Zhang April 7, 2017, 10:55 a.m. UTC | #13
On 4/7/2017 6:26 PM, Jan Beulich wrote:
>>>> On 07.04.17 at 11:53, <yu.c.zhang@linux.intel.com> wrote:
>> On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
>>>>                        e.ipat = ipat;
>>>>                        if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>>                        {
>>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>>> +                         {
>>>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>>>> +                             p2m->ioreq.entry_count--;
>>>> +                         }
>>>> +
>>>>                             e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
>>>>                                         ? p2m_ram_logdirty : p2m_ram_rw;
>>> I don't think this can be right: Why would it be valid to change the
>>> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
>>> here, without taking into account further information? This code
>>> can run at any time, not just when you want to reset things. So at
>>> the very least there is a check missing whether a suitable ioreq
>>> server still exists (and only if it doesn't you want to do the type
>>> reset).
>> Sorry, Jan. I think we have discussed this quite long ago.
>> Indeed, there's information lacked here, and that's why global_logdirty
>> is disallowed
>> when there's remaining p2m_ioreq_server entries. :-)
> log-dirty isn't the interesting part (which is why I did put it in
> parentheses). You change ioreq-server to ram-rw here
> regardless of whether we're actually cleaning up after a
> detached server.
>
>>>> @@ -965,7 +987,7 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m,
>>>>        if ( is_epte_valid(ept_entry) )
>>>>        {
>>>>            if ( (recalc || ept_entry->recalc) &&
>>>> -             p2m_is_changeable(ept_entry->sa_p2mt) )
>>>> +             p2m_check_changeable(ept_entry->sa_p2mt) )
>>> I think the distinction between these two is rather arbitrary, and I
>>> also think this is part of the problem above: Distinguishing log-dirty
>>> from ram-rw requires auxiliary data to be consulted. The same
>>> ought to apply to ioreq-server, and then there wouldn't be a need
>>> to have two p2m_*_changeable() flavors.
>> Well, I think we have also discussed this quite long ago, here is the link.
>> https://lists.xen.org/archives/html/xen-devel/2016-09/msg01017.html
> That was a different discussion, namely not considering this ...
>
>>> Of course the subsequent use p2m_is_logdirty_range() may then
>>> need amending.
>>>
>>> In the end it looks like you have the inverse problem here compared
>>> to above: You should return ram-rw when the reset was already
>>> initiated. At least that's how I would see the logic to match up with
>>> the log-dirty handling (where the _effective_ rather than the last
>>> stored type is being returned).
> ... at all.

Sorry I still do not get it.
George, do you have any idea about this?

>>>> @@ -629,6 +640,21 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>>>>            if ( entry_content.l1 != 0 )
>>>>                p2m_add_iommu_flags(&entry_content, 0, iommu_pte_flags);
>>>>    
>>>> +        p2mt_old = p2m_flags_to_type(l1e_get_flags(*p2m_entry));
>>>> +
>>>> +        /*
>>>> +         * p2m_ioreq_server is only used for 4K pages, so
>>>> +         * the count shall only be performed for level 1 entries.
>>>> +         */
>>>> +        if ( p2mt == p2m_ioreq_server )
>>>> +            p2m->ioreq.entry_count++;
>>>> +
>>>> +        if ( p2mt_old == p2m_ioreq_server )
>>>> +        {
>>>> +            ASSERT(p2m->ioreq.entry_count > 0);
>>>> +            p2m->ioreq.entry_count--;
>>>> +        }
>>>> +
>>>>            /* level 1 entry */
>>>>            p2m->write_p2m_entry(p2m, gfn, p2m_entry, entry_content, 1);
>>> I think to match up with EPT you also want to add
>>>
>>>       ASSERT(p2mt_old != p2m_ioreq_server);
>>>
>>> to the 2M and 1G paths.
>> Is this really necessary? 2M and 1G page does not have p2mt_old,
>> defining one and peek the p2m type just
>> to have an ASSERT does not seem quite useful - and will hurt the
>> performance.
> I don't follow. Matching up with the 4k case, what you'd add would
> be
>
>      ASSERT(p2m_flags_to_type(flags) != p2m_ioreq_server);

Yeah, that's a good point.

Yu

> Jan
>
Jan Beulich April 7, 2017, 11:28 a.m. UTC | #14
>>> On 07.04.17 at 12:50, <yu.c.zhang@linux.intel.com> wrote:
> On 4/7/2017 6:28 PM, George Dunlap wrote:
>> On 07/04/17 11:14, Yu Zhang wrote:
>>>
>>> On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>>> @@ -544,6 +544,12 @@ static int resolve_misconfig(struct p2m_domain
>>>>> *p2m, unsigned long gfn)
>>>>>                        e.ipat = ipat;
>>>>>                        if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
>>>>>                        {
>>>>> +                         if ( e.sa_p2mt == p2m_ioreq_server )
>>>>> +                         {
>>>>> +                             ASSERT(p2m->ioreq.entry_count > 0);
>>>>> +                             p2m->ioreq.entry_count--;
>>>>> +                         }
>>>>> +
>>>>>                             e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn
>>>>> + i, gfn + i)
>>>>>                                         ? p2m_ram_logdirty : p2m_ram_rw;
>>>> I don't think this can be right: Why would it be valid to change the
>>>> type from p2m_ioreq_server to p2m_ram_rw (or p2m_ram_logdirty)
>>>> here, without taking into account further information? This code
>>>> can run at any time, not just when you want to reset things. So at
>>>> the very least there is a check missing whether a suitable ioreq
>>>> server still exists (and only if it doesn't you want to do the type
>>>> reset).
>>> Also I do not think we need to check if a suitable ioreq server still
>>> exists. We have guaranteed
>>> in our patch that no new ioreq server will be mapped as long as the p2m
>>> table is not clean. :)
>> Jan is saying that you should only change ioreq_server -> ram if there
>> is *not* an ioreq server; and that if this is called with an ioreq
>> server still active, then it must be some other change you're looking at.
>>
>> The problem, though, is that misconfiguration happens in many
>> circumstances.  Grep for "memory_type_changed()" -- each of those
>> results in a recalculation of the entire p2m, which will (in the current
>> code) wipe out any ioreq_server entries.
> 
> Well, I'm not aware that other actions besides the logdirty will cause the reset.
> But if that would happen, will below change solve this?
> 
> @@ -546,12 +546,16 @@ static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
>                       {
>                            if ( e.sa_p2mt == p2m_ioreq_server )
>                            {
> -                             ASSERT(p2m->ioreq.entry_count > 0);
> -                             p2m->ioreq.entry_count--;
> +                             if ( p2m->ioreq.server == NULL )
> +                             {
> +                                 ASSERT(p2m->ioreq.entry_count > 0);
> +                                 p2m->ioreq.entry_count--;
> +                                 e.sa_p2mt = p2m_ram_rw;
> +                             }
>                            }
> -
> -                         e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
> -                                     ? p2m_ram_logdirty : p2m_ram_rw;
> +                         else
> +                             e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
> +                                         ? p2m_ram_logdirty : p2m_ram_rw;

Now you _never_ change away from ioreq-server, you only adjust
the counter.

Jan
Jan Beulich April 7, 2017, 11:31 a.m. UTC | #15
>>> On 07.04.17 at 12:55, <yu.c.zhang@linux.intel.com> wrote:
> On 4/7/2017 6:26 PM, Jan Beulich wrote:
>>>>> On 07.04.17 at 11:53, <yu.c.zhang@linux.intel.com> wrote:
>>> On 4/7/2017 5:40 PM, Jan Beulich wrote:
>>>>>>> On 06.04.17 at 17:53, <yu.c.zhang@linux.intel.com> wrote:
>>>>> @@ -965,7 +987,7 @@ static mfn_t ept_get_entry(struct p2m_domain *p2m,
>>>>>        if ( is_epte_valid(ept_entry) )
>>>>>        {
>>>>>            if ( (recalc || ept_entry->recalc) &&
>>>>> -             p2m_is_changeable(ept_entry->sa_p2mt) )
>>>>> +             p2m_check_changeable(ept_entry->sa_p2mt) )
>>>> I think the distinction between these two is rather arbitrary, and I
>>>> also think this is part of the problem above: Distinguishing log-dirty
>>>> from ram-rw requires auxiliary data to be consulted. The same
>>>> ought to apply to ioreq-server, and then there wouldn't be a need
>>>> to have two p2m_*_changeable() flavors.
>>> Well, I think we have also discussed this quite long ago, here is the link.
>>> https://lists.xen.org/archives/html/xen-devel/2016-09/msg01017.html 
>> That was a different discussion, namely not considering this ...
>>
>>>> Of course the subsequent use p2m_is_logdirty_range() may then
>>>> need amending.
>>>>
>>>> In the end it looks like you have the inverse problem here compared
>>>> to above: You should return ram-rw when the reset was already
>>>> initiated. At least that's how I would see the logic to match up with
>>>> the log-dirty handling (where the _effective_ rather than the last
>>>> stored type is being returned).
>> ... at all.
> 
> Sorry I still do not get it.

Leaving ioreq-server out of the picture, what the function returns
to the caller is the type as it would be if a re-calculation would have
been done on the entry, even if that hasn't happened yet (the
function here shouldn't change any entries after all). I think we
want the same behavior here for what have been ioreq-server
entries (and which would become ram-rw after the next re-calc).

Jan
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index 5bf3b6d..07a6c26 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -955,6 +955,14 @@  int hvm_map_mem_type_to_ioreq_server(struct domain *d, ioservid_t id,
 
     spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
 
+    if ( rc == 0 && flags == 0 )
+    {
+        struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+        if ( read_atomic(&p2m->ioreq.entry_count) )
+            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
+    }
+
     return rc;
 }
 
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index a57b385..4b591fe 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -187,6 +187,15 @@  out:
  */
 static int hap_enable_log_dirty(struct domain *d, bool_t log_global)
 {
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    /*
+     * Refuse to turn on global log-dirty mode if
+     * there are outstanding p2m_ioreq_server pages.
+     */
+    if ( log_global && read_atomic(&p2m->ioreq.entry_count) )
+        return -EBUSY;
+
     /* turn on PG_log_dirty bit in paging mode */
     paging_lock(d);
     d->arch.paging.mode |= PG_log_dirty;
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index cc1eb21..ecd5ceb 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -544,6 +544,12 @@  static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
                     e.ipat = ipat;
                     if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
                     {
+                         if ( e.sa_p2mt == p2m_ioreq_server )
+                         {
+                             ASSERT(p2m->ioreq.entry_count > 0);
+                             p2m->ioreq.entry_count--;
+                         }
+
                          e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
                                      ? p2m_ram_logdirty : p2m_ram_rw;
                          ept_p2m_type_to_flags(p2m, &e, e.sa_p2mt, e.access);
@@ -816,6 +822,22 @@  ept_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
         new_entry.suppress_ve = is_epte_valid(&old_entry) ?
                                     old_entry.suppress_ve : 1;
 
+    /*
+     * p2m_ioreq_server is only used for 4K pages, so the
+     * count shall only happen on ept page table entries.
+     */
+    if ( p2mt == p2m_ioreq_server )
+    {
+        ASSERT(i == 0);
+        p2m->ioreq.entry_count++;
+    }
+
+    if ( ept_entry->sa_p2mt == p2m_ioreq_server )
+    {
+        ASSERT(p2m->ioreq.entry_count > 0 && i == 0);
+        p2m->ioreq.entry_count--;
+    }
+
     rc = atomic_write_ept_entry(ept_entry, new_entry, target);
     if ( unlikely(rc) )
         old_entry.epte = 0;
@@ -965,7 +987,7 @@  static mfn_t ept_get_entry(struct p2m_domain *p2m,
     if ( is_epte_valid(ept_entry) )
     {
         if ( (recalc || ept_entry->recalc) &&
-             p2m_is_changeable(ept_entry->sa_p2mt) )
+             p2m_check_changeable(ept_entry->sa_p2mt) )
             *t = p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
                                                       : p2m_ram_rw;
         else
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index c0055f3..4b2ff9e 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -436,11 +436,13 @@  static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
          needs_recalc(l1, *pent) )
     {
         l1_pgentry_t e = *pent;
+        p2m_type_t p2mt_old;
 
         if ( !valid_recalc(l1, e) )
             P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
                       p2m->domain->domain_id, gfn, level);
-        if ( p2m_is_changeable(p2m_flags_to_type(l1e_get_flags(e))) )
+        p2mt_old = p2m_flags_to_type(l1e_get_flags(e));
+        if ( p2m_is_changeable(p2mt_old) )
         {
             unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
             p2m_type_t p2mt = p2m_is_logdirty_range(p2m, gfn & mask, gfn | ~mask)
@@ -460,6 +462,13 @@  static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
                      mfn &= ~((unsigned long)_PAGE_PSE_PAT >> PAGE_SHIFT);
                 flags |= _PAGE_PSE;
             }
+
+            if ( p2mt_old == p2m_ioreq_server )
+            {
+                ASSERT(p2m->ioreq.entry_count > 0);
+                p2m->ioreq.entry_count--;
+            }
+
             e = l1e_from_pfn(mfn, flags);
             p2m_add_iommu_flags(&e, level,
                                 (p2mt == p2m_ram_rw)
@@ -606,6 +615,8 @@  p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
 
     if ( page_order == PAGE_ORDER_4K )
     {
+        p2m_type_t p2mt_old;
+
         rc = p2m_next_level(p2m, &table, &gfn_remainder, gfn,
                             L2_PAGETABLE_SHIFT - PAGE_SHIFT,
                             L2_PAGETABLE_ENTRIES, PGT_l1_page_table, 1);
@@ -629,6 +640,21 @@  p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
         if ( entry_content.l1 != 0 )
             p2m_add_iommu_flags(&entry_content, 0, iommu_pte_flags);
 
+        p2mt_old = p2m_flags_to_type(l1e_get_flags(*p2m_entry));
+
+        /*
+         * p2m_ioreq_server is only used for 4K pages, so
+         * the count shall only be performed for level 1 entries.
+         */
+        if ( p2mt == p2m_ioreq_server )
+            p2m->ioreq.entry_count++;
+
+        if ( p2mt_old == p2m_ioreq_server )
+        {
+            ASSERT(p2m->ioreq.entry_count > 0);
+            p2m->ioreq.entry_count--;
+        }
+
         /* level 1 entry */
         p2m->write_p2m_entry(p2m, gfn, p2m_entry, entry_content, 1);
         /* NB: paging_write_p2m_entry() handles tlb flushes properly */
@@ -729,7 +755,7 @@  p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
 static inline p2m_type_t recalc_type(bool_t recalc, p2m_type_t t,
                                      struct p2m_domain *p2m, unsigned long gfn)
 {
-    if ( !recalc || !p2m_is_changeable(t) )
+    if ( !recalc || !p2m_check_changeable(t) )
         return t;
     return p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
                                                 : p2m_ram_rw;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index b84add0..4169d18 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -317,6 +317,15 @@  int p2m_set_ioreq_server(struct domain *d,
         if ( p2m->ioreq.server != NULL )
             goto out;
 
+        /*
+         * It is possible that an ioreq server has just been unmapped,
+         * released the spin lock, with some p2m_ioreq_server entries
+         * in p2m table remained. We shall refuse another ioreq server
+         * mapping request in such case.
+         */
+        if ( read_atomic(&p2m->ioreq.entry_count) )
+            goto out;
+
         p2m->ioreq.server = s;
         p2m->ioreq.flags = flags;
     }
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 4521620..e7e390d 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -120,7 +120,10 @@  typedef unsigned int p2m_query_t;
 
 /* Types that can be subject to bulk transitions. */
 #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
-                              | p2m_to_mask(p2m_ram_logdirty) )
+                              | p2m_to_mask(p2m_ram_logdirty) \
+                              | p2m_to_mask(p2m_ioreq_server) )
+
+#define P2M_IOREQ_TYPES (p2m_to_mask(p2m_ioreq_server))
 
 #define P2M_POD_TYPES (p2m_to_mask(p2m_populate_on_demand))
 
@@ -157,6 +160,7 @@  typedef unsigned int p2m_query_t;
 #define p2m_is_readonly(_t) (p2m_to_mask(_t) & P2M_RO_TYPES)
 #define p2m_is_discard_write(_t) (p2m_to_mask(_t) & P2M_DISCARD_WRITE_TYPES)
 #define p2m_is_changeable(_t) (p2m_to_mask(_t) & P2M_CHANGEABLE_TYPES)
+#define p2m_is_ioreq(_t) (p2m_to_mask(_t) & P2M_IOREQ_TYPES)
 #define p2m_is_pod(_t) (p2m_to_mask(_t) & P2M_POD_TYPES)
 #define p2m_is_grant(_t) (p2m_to_mask(_t) & P2M_GRANT_TYPES)
 /* Grant types are *not* considered valid, because they can be
@@ -178,6 +182,8 @@  typedef unsigned int p2m_query_t;
 
 #define p2m_allows_invalid_mfn(t) (p2m_to_mask(t) & P2M_INVALID_MFN_TYPES)
 
+#define p2m_check_changeable(t) (p2m_is_changeable(t) && !p2m_is_ioreq(t))
+
 typedef enum {
     p2m_host,
     p2m_nested,
@@ -349,6 +355,7 @@  struct p2m_domain {
           * are to be emulated by an ioreq server.
           */
          unsigned int flags;
+         unsigned long entry_count;
      } ioreq;
 };