diff mbox

[PATCHv6,2/2] x86/ept: defer the invalidation until the p2m lock is released

Message ID 1450446634-8762-3-git-send-email-david.vrabel@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

David Vrabel Dec. 18, 2015, 1:50 p.m. UTC
Holding the p2m lock while calling ept_sync_domain() is very expensive
since it does a on_selected_cpus() call.  IPIs on many socket machines
can be very slows and on_selected_cpus() is serialized.

It is safe to defer the invalidate until the p2m lock is released
except for two cases:

1. When freeing a page table page (since partial translations may be
   cached).
2. When reclaiming a zero page as part of PoD.

For these cases, add p2m_tlb_flush_sync() calls which will immediately
perform the invalidate before the page is freed or reclaimed.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
v6:
- Move p2m_tlb_flush_sync() to immediately before p2m_free_ptp().  It was
  called all the time otherwise.

v5:
- add p2m_tlb_flush_sync() and call it before freeing pgae table pages
  and reclaiming zeroed pod pages.

v2:
- use per-p2m list for deferred pages.
- update synced_mask while holding write lock.
---
 xen/arch/x86/mm/mm-locks.h | 23 +++++++++++++++--------
 xen/arch/x86/mm/p2m-ept.c  | 42 ++++++++++++++++++++++++++++++++++--------
 xen/arch/x86/mm/p2m-pod.c  |  2 ++
 xen/arch/x86/mm/p2m.c      | 14 ++++++++++++++
 xen/include/asm-x86/p2m.h  | 10 ++++++++++
 5 files changed, 75 insertions(+), 16 deletions(-)

Comments

Tian, Kevin Dec. 20, 2015, 6:56 a.m. UTC | #1
> From: David Vrabel [mailto:david.vrabel@citrix.com]
> Sent: Friday, December 18, 2015 9:51 PM
> 
> Holding the p2m lock while calling ept_sync_domain() is very expensive
> since it does a on_selected_cpus() call.  IPIs on many socket machines
> can be very slows and on_selected_cpus() is serialized.
> 
> It is safe to defer the invalidate until the p2m lock is released
> except for two cases:
> 
> 1. When freeing a page table page (since partial translations may be
>    cached).
> 2. When reclaiming a zero page as part of PoD.
> 
> For these cases, add p2m_tlb_flush_sync() calls which will immediately
> perform the invalidate before the page is freed or reclaimed.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

In a quick glimpse the patch is OK. Will find a time to think about it more
carefully since this part is tricky (also allow others to comment before
my ack in EPT part).

Thanks
Kevin
George Dunlap Dec. 22, 2015, 12:23 p.m. UTC | #2
On 18/12/15 13:50, David Vrabel wrote:
> Holding the p2m lock while calling ept_sync_domain() is very expensive
> since it does a on_selected_cpus() call.  IPIs on many socket machines
> can be very slows and on_selected_cpus() is serialized.
> 
> It is safe to defer the invalidate until the p2m lock is released
> except for two cases:
> 
> 1. When freeing a page table page (since partial translations may be
>    cached).
> 2. When reclaiming a zero page as part of PoD.
> 
> For these cases, add p2m_tlb_flush_sync() calls which will immediately
> perform the invalidate before the page is freed or reclaimed.

There are at least two other places in the PoD code where the "remove ->
check -> add to cache -> unlock" pattern exist; and it looks to me like
there are other places where races might occur (e.g.,
p2m_paging_evict(), which does remove -> scrub -> put -> unlock;
p2m_altp2m_propagate_change(), which does remove -> put -> unlock).

Part of me wonders whether, rather than making callers that need it
remember to do a flush, it might be better to explicitly pass in
P2M_FLUSH or P2M_CAN_DEFER when calling p2m_set_entry, just to make
people think about the fact that the p2m change may not actually take
effect until later.  Any thoughts on that?

Comments on the current approach inline.

> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
> index c094320..43c7f1b 100644
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -263,6 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
>          unmap_domain_page(epte);
>      }
>      
> +    p2m_tlb_flush_sync(p2m);
>      p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));

It's probably worth a comment here pointing out that even if this
function is called several times (e.g., if you replace a load of 4k
entries with a 1G entry), the actual flush will only happen the first time.

> +static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
> +{
> +    p2m->need_flush = 0;
> +    if ( unlock )
> +        mm_write_unlock(&p2m->lock);
> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
>  }

Having a function called "flush_and_unlock", with a boolean as to
whether to unlock or not, just seems a bit wonky.

Wouldn't it make more sense to have the hook just named "flush_sync()",
and move the unlocking out in the generic p2m code (where you already
have the check for need_flush)?

> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
> index fa46dd9..9c394c2 100644
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -261,6 +261,10 @@ struct p2m_domain {
>                                            unsigned long gfn, l1_pgentry_t *p,
>                                            l1_pgentry_t new, unsigned int level);
>      long               (*audit_p2m)(struct p2m_domain *p2m);
> +    void               (*flush_and_unlock)(struct p2m_domain *p2m, bool_t unlock);
> +
> +    unsigned int defer_flush;
> +    bool_t need_flush;

It's probably worth a comment that at the moment calling
flush_and_unlock() is gated on need_flush; so it's OK not to implement
flush_and_unlock() as long as you never set need_flush.

Thanks,
 -George
Andrew Cooper Dec. 22, 2015, 2:01 p.m. UTC | #3
On 22/12/15 12:23, George Dunlap wrote:
> On 18/12/15 13:50, David Vrabel wrote:
>> Holding the p2m lock while calling ept_sync_domain() is very expensive
>> since it does a on_selected_cpus() call.  IPIs on many socket machines
>> can be very slows and on_selected_cpus() is serialized.
>>
>> It is safe to defer the invalidate until the p2m lock is released
>> except for two cases:
>>
>> 1. When freeing a page table page (since partial translations may be
>>    cached).
>> 2. When reclaiming a zero page as part of PoD.
>>
>> For these cases, add p2m_tlb_flush_sync() calls which will immediately
>> perform the invalidate before the page is freed or reclaimed.
> There are at least two other places in the PoD code where the "remove ->
> check -> add to cache -> unlock" pattern exist; and it looks to me like
> there are other places where races might occur (e.g.,
> p2m_paging_evict(), which does remove -> scrub -> put -> unlock;
> p2m_altp2m_propagate_change(), which does remove -> put -> unlock).
>
> Part of me wonders whether, rather than making callers that need it
> remember to do a flush, it might be better to explicitly pass in
> P2M_FLUSH or P2M_CAN_DEFER when calling p2m_set_entry, just to make
> people think about the fact that the p2m change may not actually take
> effect until later.  Any thoughts on that?
>
> Comments on the current approach inline.
>
>> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
>> index c094320..43c7f1b 100644
>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> @@ -263,6 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
>>          unmap_domain_page(epte);
>>      }
>>      
>> +    p2m_tlb_flush_sync(p2m);
>>      p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
> It's probably worth a comment here pointing out that even if this
> function is called several times (e.g., if you replace a load of 4k
> entries with a 1G entry), the actual flush will only happen the first time.
>
>> +static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
>> +{
>> +    p2m->need_flush = 0;
>> +    if ( unlock )
>> +        mm_write_unlock(&p2m->lock);
>> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
>>  }
> Having a function called "flush_and_unlock", with a boolean as to
> whether to unlock or not, just seems a bit wonky.
>
> Wouldn't it make more sense to have the hook just named "flush_sync()",
> and move the unlocking out in the generic p2m code (where you already
> have the check for need_flush)?
>
>> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
>> index fa46dd9..9c394c2 100644
>> --- a/xen/include/asm-x86/p2m.h
>> +++ b/xen/include/asm-x86/p2m.h
>> @@ -261,6 +261,10 @@ struct p2m_domain {
>>                                            unsigned long gfn, l1_pgentry_t *p,
>>                                            l1_pgentry_t new, unsigned int level);
>>      long               (*audit_p2m)(struct p2m_domain *p2m);
>> +    void               (*flush_and_unlock)(struct p2m_domain *p2m, bool_t unlock);
>> +
>> +    unsigned int defer_flush;
>> +    bool_t need_flush;
> It's probably worth a comment that at the moment calling
> flush_and_unlock() is gated on need_flush; so it's OK not to implement
> flush_and_unlock() as long as you never set need_flush.

This is just one small accident (in code elsewhere) away from a code
injection vulnerability.

Either we should require that all function pointers are filled in, or
BUG() if the pointer is missing when we attempt to use it.

~Andrew
David Vrabel Dec. 22, 2015, 2:20 p.m. UTC | #4
On 22/12/15 14:01, Andrew Cooper wrote:
> On 22/12/15 12:23, George Dunlap wrote:
>> On 18/12/15 13:50, David Vrabel wrote:
>>> Holding the p2m lock while calling ept_sync_domain() is very expensive
>>> since it does a on_selected_cpus() call.  IPIs on many socket machines
>>> can be very slows and on_selected_cpus() is serialized.
>>>
>>> It is safe to defer the invalidate until the p2m lock is released
>>> except for two cases:
>>>
>>> 1. When freeing a page table page (since partial translations may be
>>>    cached).
>>> 2. When reclaiming a zero page as part of PoD.
>>>
>>> For these cases, add p2m_tlb_flush_sync() calls which will immediately
>>> perform the invalidate before the page is freed or reclaimed.
>> There are at least two other places in the PoD code where the "remove ->
>> check -> add to cache -> unlock" pattern exist; and it looks to me like
>> there are other places where races might occur (e.g.,
>> p2m_paging_evict(), which does remove -> scrub -> put -> unlock;
>> p2m_altp2m_propagate_change(), which does remove -> put -> unlock).
>>
>> Part of me wonders whether, rather than making callers that need it
>> remember to do a flush, it might be better to explicitly pass in
>> P2M_FLUSH or P2M_CAN_DEFER when calling p2m_set_entry, just to make
>> people think about the fact that the p2m change may not actually take
>> effect until later.  Any thoughts on that?
>>
>> Comments on the current approach inline.
>>
>>> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
>>> index c094320..43c7f1b 100644
>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>> @@ -263,6 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
>>>          unmap_domain_page(epte);
>>>      }
>>>      
>>> +    p2m_tlb_flush_sync(p2m);
>>>      p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
>> It's probably worth a comment here pointing out that even if this
>> function is called several times (e.g., if you replace a load of 4k
>> entries with a 1G entry), the actual flush will only happen the first time.
>>
>>> +static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
>>> +{
>>> +    p2m->need_flush = 0;
>>> +    if ( unlock )
>>> +        mm_write_unlock(&p2m->lock);
>>> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
>>>  }
>> Having a function called "flush_and_unlock", with a boolean as to
>> whether to unlock or not, just seems a bit wonky.
>>
>> Wouldn't it make more sense to have the hook just named "flush_sync()",
>> and move the unlocking out in the generic p2m code (where you already
>> have the check for need_flush)?
>>
>>> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
>>> index fa46dd9..9c394c2 100644
>>> --- a/xen/include/asm-x86/p2m.h
>>> +++ b/xen/include/asm-x86/p2m.h
>>> @@ -261,6 +261,10 @@ struct p2m_domain {
>>>                                            unsigned long gfn, l1_pgentry_t *p,
>>>                                            l1_pgentry_t new, unsigned int level);
>>>      long               (*audit_p2m)(struct p2m_domain *p2m);
>>> +    void               (*flush_and_unlock)(struct p2m_domain *p2m, bool_t unlock);
>>> +
>>> +    unsigned int defer_flush;
>>> +    bool_t need_flush;
>> It's probably worth a comment that at the moment calling
>> flush_and_unlock() is gated on need_flush; so it's OK not to implement
>> flush_and_unlock() as long as you never set need_flush.
> 
> This is just one small accident (in code elsewhere) away from a code
> injection vulnerability.
> 
> Either we should require that all function pointers are filled in, or
> BUG() if the pointer is missing when we attempt to use it.

Jan asked for the call to be conditional on need_flush and to not test
flush_and_unlock.

David
George Dunlap Dec. 22, 2015, 2:56 p.m. UTC | #5
On 22/12/15 14:20, David Vrabel wrote:
> On 22/12/15 14:01, Andrew Cooper wrote:
>> On 22/12/15 12:23, George Dunlap wrote:
>>> On 18/12/15 13:50, David Vrabel wrote:
>>>> Holding the p2m lock while calling ept_sync_domain() is very expensive
>>>> since it does a on_selected_cpus() call.  IPIs on many socket machines
>>>> can be very slows and on_selected_cpus() is serialized.
>>>>
>>>> It is safe to defer the invalidate until the p2m lock is released
>>>> except for two cases:
>>>>
>>>> 1. When freeing a page table page (since partial translations may be
>>>>    cached).
>>>> 2. When reclaiming a zero page as part of PoD.
>>>>
>>>> For these cases, add p2m_tlb_flush_sync() calls which will immediately
>>>> perform the invalidate before the page is freed or reclaimed.
>>> There are at least two other places in the PoD code where the "remove ->
>>> check -> add to cache -> unlock" pattern exist; and it looks to me like
>>> there are other places where races might occur (e.g.,
>>> p2m_paging_evict(), which does remove -> scrub -> put -> unlock;
>>> p2m_altp2m_propagate_change(), which does remove -> put -> unlock).
>>>
>>> Part of me wonders whether, rather than making callers that need it
>>> remember to do a flush, it might be better to explicitly pass in
>>> P2M_FLUSH or P2M_CAN_DEFER when calling p2m_set_entry, just to make
>>> people think about the fact that the p2m change may not actually take
>>> effect until later.  Any thoughts on that?
>>>
>>> Comments on the current approach inline.
>>>
>>>> diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
>>>> index c094320..43c7f1b 100644
>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>> @@ -263,6 +263,7 @@ static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
>>>>          unmap_domain_page(epte);
>>>>      }
>>>>      
>>>> +    p2m_tlb_flush_sync(p2m);
>>>>      p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
>>> It's probably worth a comment here pointing out that even if this
>>> function is called several times (e.g., if you replace a load of 4k
>>> entries with a 1G entry), the actual flush will only happen the first time.
>>>
>>>> +static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
>>>> +{
>>>> +    p2m->need_flush = 0;
>>>> +    if ( unlock )
>>>> +        mm_write_unlock(&p2m->lock);
>>>> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
>>>>  }
>>> Having a function called "flush_and_unlock", with a boolean as to
>>> whether to unlock or not, just seems a bit wonky.
>>>
>>> Wouldn't it make more sense to have the hook just named "flush_sync()",
>>> and move the unlocking out in the generic p2m code (where you already
>>> have the check for need_flush)?
>>>
>>>> diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
>>>> index fa46dd9..9c394c2 100644
>>>> --- a/xen/include/asm-x86/p2m.h
>>>> +++ b/xen/include/asm-x86/p2m.h
>>>> @@ -261,6 +261,10 @@ struct p2m_domain {
>>>>                                            unsigned long gfn, l1_pgentry_t *p,
>>>>                                            l1_pgentry_t new, unsigned int level);
>>>>      long               (*audit_p2m)(struct p2m_domain *p2m);
>>>> +    void               (*flush_and_unlock)(struct p2m_domain *p2m, bool_t unlock);
>>>> +
>>>> +    unsigned int defer_flush;
>>>> +    bool_t need_flush;
>>> It's probably worth a comment that at the moment calling
>>> flush_and_unlock() is gated on need_flush; so it's OK not to implement
>>> flush_and_unlock() as long as you never set need_flush.
>>
>> This is just one small accident (in code elsewhere) away from a code
>> injection vulnerability.
>>
>> Either we should require that all function pointers are filled in, or
>> BUG() if the pointer is missing when we attempt to use it.
> 
> Jan asked for the call to be conditional on need_flush and to not test
> flush_and_unlock.

Then perhaps the other paging modes should point this to a function
which calls BUG()?  Or perhaps a noop -- no point in crashing a machine
in production because you don't need to actually do a flush.

 -George
David Vrabel Feb. 1, 2016, 2:50 p.m. UTC | #6
On 20/12/15 06:56, Tian, Kevin wrote:
>> From: David Vrabel [mailto:david.vrabel@citrix.com]
>> Sent: Friday, December 18, 2015 9:51 PM
>>
>> Holding the p2m lock while calling ept_sync_domain() is very expensive
>> since it does a on_selected_cpus() call.  IPIs on many socket machines
>> can be very slows and on_selected_cpus() is serialized.
>>
>> It is safe to defer the invalidate until the p2m lock is released
>> except for two cases:
>>
>> 1. When freeing a page table page (since partial translations may be
>>    cached).
>> 2. When reclaiming a zero page as part of PoD.
>>
>> For these cases, add p2m_tlb_flush_sync() calls which will immediately
>> perform the invalidate before the page is freed or reclaimed.
>>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> 
> In a quick glimpse the patch is OK. Will find a time to think about it more
> carefully since this part is tricky (also allow others to comment before
> my ack in EPT part).

Ping?  Did you get more time to think about this patch?

David
David Vrabel Feb. 1, 2016, 3:57 p.m. UTC | #7
On 22/12/15 12:23, George Dunlap wrote:
> On 18/12/15 13:50, David Vrabel wrote:
>> Holding the p2m lock while calling ept_sync_domain() is very expensive
>> since it does a on_selected_cpus() call.  IPIs on many socket machines
>> can be very slows and on_selected_cpus() is serialized.
>>
>> It is safe to defer the invalidate until the p2m lock is released
>> except for two cases:
>>
>> 1. When freeing a page table page (since partial translations may be
>>    cached).
>> 2. When reclaiming a zero page as part of PoD.
>>
>> For these cases, add p2m_tlb_flush_sync() calls which will immediately
>> perform the invalidate before the page is freed or reclaimed.
> 
> There are at least two other places in the PoD code where the "remove ->
> check -> add to cache -> unlock" pattern exist; and it looks to me like
> there are other places where races might occur (e.g.,
> p2m_paging_evict(), which does remove -> scrub -> put -> unlock;
> p2m_altp2m_propagate_change(), which does remove -> put -> unlock).

I've fixed the PoD cases.

Freeing pages back to the domheap is protected by the tlb flush
timestamp mechanism.  See patch #1 in this series which makes use of
this for invalidating the guest-physical mappings (in addition to any
linear and combined mappings that were already flushes).

> Part of me wonders whether, rather than making callers that need it
> remember to do a flush, it might be better to explicitly pass in
> P2M_FLUSH or P2M_CAN_DEFER when calling p2m_set_entry, just to make
> people think about the fact that the p2m change may not actually take
> effect until later.  Any thoughts on that?

It is more efficient to batch flushes as much as possible, so explicit
flushes just before they are needed are better.  I also think it's clear
to have an explicit flush before the operation that actually needs it.

>> +static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
>> +{
>> +    p2m->need_flush = 0;
>> +    if ( unlock )
>> +        mm_write_unlock(&p2m->lock);
>> +    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
>>  }
> 
> Having a function called "flush_and_unlock", with a boolean as to
> whether to unlock or not, just seems a bit wonky.
> 
> Wouldn't it make more sense to have the hook just named "flush_sync()",
> and move the unlocking out in the generic p2m code (where you already
> have the check for need_flush)?

The ept_sync_domain_mask() call must be outside the lock when unlock is
true or you don't get the performance benefits from this patch.

David
Tian, Kevin Feb. 2, 2016, 7:58 a.m. UTC | #8
> From: David Vrabel [mailto:david.vrabel@citrix.com]
> Sent: Monday, February 01, 2016 10:50 PM
> 
> On 20/12/15 06:56, Tian, Kevin wrote:
> >> From: David Vrabel [mailto:david.vrabel@citrix.com]
> >> Sent: Friday, December 18, 2015 9:51 PM
> >>
> >> Holding the p2m lock while calling ept_sync_domain() is very expensive
> >> since it does a on_selected_cpus() call.  IPIs on many socket machines
> >> can be very slows and on_selected_cpus() is serialized.
> >>
> >> It is safe to defer the invalidate until the p2m lock is released
> >> except for two cases:
> >>
> >> 1. When freeing a page table page (since partial translations may be
> >>    cached).
> >> 2. When reclaiming a zero page as part of PoD.
> >>
> >> For these cases, add p2m_tlb_flush_sync() calls which will immediately
> >> perform the invalidate before the page is freed or reclaimed.
> >>
> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> >
> > In a quick glimpse the patch is OK. Will find a time to think about it more
> > carefully since this part is tricky (also allow others to comment before
> > my ack in EPT part).
> 
> Ping?  Did you get more time to think about this patch?
> 

Sorry for delay on this. I saw you've sent out a v7 on this. Queued to
be reviewed tomorrow.

Thanks
Kevin
diff mbox

Patch

diff --git a/xen/arch/x86/mm/mm-locks.h b/xen/arch/x86/mm/mm-locks.h
index 76c7217..b1a92e2 100644
--- a/xen/arch/x86/mm/mm-locks.h
+++ b/xen/arch/x86/mm/mm-locks.h
@@ -263,14 +263,21 @@  declare_mm_lock(altp2mlist)
  */
 
 declare_mm_rwlock(altp2m);
-#define p2m_lock(p)                         \
-{                                           \
-    if ( p2m_is_altp2m(p) )                 \
-        mm_write_lock(altp2m, &(p)->lock);  \
-    else                                    \
-        mm_write_lock(p2m, &(p)->lock);     \
-}
-#define p2m_unlock(p)         mm_write_unlock(&(p)->lock);
+#define p2m_lock(p)                             \
+    do {                                        \
+        if ( p2m_is_altp2m(p) )                 \
+            mm_write_lock(altp2m, &(p)->lock);  \
+        else                                    \
+            mm_write_lock(p2m, &(p)->lock);     \
+        (p)->defer_flush++;                     \
+    } while (0)
+#define p2m_unlock(p)                           \
+    do {                                        \
+        if ( --(p)->defer_flush == 0 )          \
+            p2m_tlb_flush_and_unlock(p);        \
+        else                                    \
+            mm_write_unlock(&(p)->lock);        \
+    } while (0)
 #define gfn_lock(p,g,o)       p2m_lock(p)
 #define gfn_unlock(p,g,o)     p2m_unlock(p)
 #define p2m_read_lock(p)      mm_read_lock(p2m, &(p)->lock)
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index c094320..43c7f1b 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -263,6 +263,7 @@  static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int l
         unmap_domain_page(epte);
     }
     
+    p2m_tlb_flush_sync(p2m);
     p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
 }
 
@@ -1095,15 +1096,10 @@  static void __ept_sync_domain(void *info)
      */
 }
 
-void ept_sync_domain(struct p2m_domain *p2m)
+static void ept_sync_domain_prepare(struct p2m_domain *p2m)
 {
     struct domain *d = p2m->domain;
     struct ept_data *ept = &p2m->ept;
-    /* Only if using EPT and this domain has some VCPUs to dirty. */
-    if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] )
-        return;
-
-    ASSERT(local_irq_is_enabled());
 
     if ( nestedhvm_enabled(d) && !p2m_is_nestedp2m(p2m) )
         p2m_flush_nestedp2m(d);
@@ -1116,9 +1112,38 @@  void ept_sync_domain(struct p2m_domain *p2m)
      *    of an EP4TA reuse is still needed.
      */
     cpumask_setall(ept->invalidate);
+}
+
+static void ept_sync_domain_mask(struct p2m_domain *p2m, const cpumask_t *mask)
+{
+    on_selected_cpus(mask, __ept_sync_domain, p2m, 1);
+}
+
+void ept_sync_domain(struct p2m_domain *p2m)
+{
+    struct domain *d = p2m->domain;
 
-    on_selected_cpus(d->domain_dirty_cpumask,
-                     __ept_sync_domain, p2m, 1);
+    /* Only if using EPT and this domain has some VCPUs to dirty. */
+    if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] )
+        return;
+
+    ept_sync_domain_prepare(p2m);
+
+    if ( p2m->defer_flush )
+    {
+        p2m->need_flush = 1;
+        return;
+    }
+
+    ept_sync_domain_mask(p2m, d->domain_dirty_cpumask);
+}
+
+static void ept_flush_and_unlock(struct p2m_domain *p2m, bool_t unlock)
+{
+    p2m->need_flush = 0;
+    if ( unlock )
+        mm_write_unlock(&p2m->lock);
+    ept_sync_domain_mask(p2m, p2m->domain->domain_dirty_cpumask);
 }
 
 static void ept_enable_pml(struct p2m_domain *p2m)
@@ -1169,6 +1194,7 @@  int ept_p2m_init(struct p2m_domain *p2m)
     p2m->change_entry_type_range = ept_change_entry_type_range;
     p2m->memory_type_changed = ept_memory_type_changed;
     p2m->audit_p2m = NULL;
+    p2m->flush_and_unlock = ept_flush_and_unlock;
 
     /* Set the memory type used when accessing EPT paging structures. */
     ept->ept_mt = EPT_DEFAULT_MT;
diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c
index ea16d3e..a5d672e 100644
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -886,6 +886,8 @@  p2m_pod_zero_check(struct p2m_domain *p2m, unsigned long *gfns, int count)
         }
     }
 
+    p2m_tlb_flush_sync(p2m);
+
     /* Now check each page for real */
     for ( i=0; i < count; i++ )
     {
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index ed0bbd7..efb15cd 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -324,6 +324,20 @@  void p2m_flush_hardware_cached_dirty(struct domain *d)
     }
 }
 
+void p2m_tlb_flush_sync(struct p2m_domain *p2m)
+{
+    if ( p2m->need_flush )
+        p2m->flush_and_unlock(p2m, 0);
+}
+
+void p2m_tlb_flush_and_unlock(struct p2m_domain *p2m)
+{
+    if ( p2m->need_flush )
+        p2m->flush_and_unlock(p2m, 1);
+    else
+        mm_write_unlock(&p2m->lock);
+}
+
 mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn,
                     p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
                     unsigned int *page_order, bool_t locked)
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index fa46dd9..9c394c2 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -261,6 +261,10 @@  struct p2m_domain {
                                           unsigned long gfn, l1_pgentry_t *p,
                                           l1_pgentry_t new, unsigned int level);
     long               (*audit_p2m)(struct p2m_domain *p2m);
+    void               (*flush_and_unlock)(struct p2m_domain *p2m, bool_t unlock);
+
+    unsigned int defer_flush;
+    bool_t need_flush;
 
     /* Default P2M access type for each page in the the domain: new pages,
      * swapped in pages, cleared pages, and pages that are ambiguously
@@ -353,6 +357,12 @@  static inline bool_t p2m_is_altp2m(const struct p2m_domain *p2m)
 
 #define p2m_get_pagetable(p2m)  ((p2m)->phys_table)
 
+/*
+ * Ensure any deferred p2m TLB flush has been completed on all VCPUs.
+ */
+void p2m_tlb_flush_sync(struct p2m_domain *p2m);
+void p2m_tlb_flush_and_unlock(struct p2m_domain *p2m);
+
 /**** p2m query accessors. They lock p2m_lock, and thus serialize
  * lookups wrt modifications. They _do not_ release the lock on exit.
  * After calling any of the variants below, caller needs to use