diff mbox

[v3,1/1] xen: move TLB-flush filtering out into populate_physmap during vm creation

Message ID 1473312603-28581-1-git-send-email-dongli.zhang@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dongli Zhang Sept. 8, 2016, 5:30 a.m. UTC
This patch implemented parts of TODO left in commit id
a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out
into populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very
slow to create a guest with memory size of more than 100GB on host with
100+ cpus.

This patch introduced a "MEMF_no_tlbflush" bit to memflags to indicate
whether TLB-flush should be done in alloc_heap_pages or its caller
populate_physmap. Once this bit is set in memflags, alloc_heap_pages will
ignore TLB-flush. To use this bit after vm is created might lead to
security issue, that is, this would make pages accessible to the guest B,
when guest A may still have a cached mapping to them.

Therefore, this patch also introduced a "already_scheduled" field to struct
domain to indicate whether this domain has ever got scheduled by
hypervisor.  MEMF_no_tlbflush can be set only during vm creation phase when
already_scheduled is still 0 before this domain gets scheduled for the
first time.

TODO: ballooning very huge amount of memory cannot benefit from this patch
and might still be slow.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>

---
Changed since v2:
  * Limit this optimization to domain creation time.

---
 xen/common/domain.c     |  2 ++
 xen/common/memory.c     | 33 +++++++++++++++++++++++++++++++++
 xen/common/page_alloc.c |  3 ++-
 xen/common/schedule.c   |  5 +++++
 xen/include/xen/mm.h    |  2 ++
 xen/include/xen/sched.h |  3 +++
 6 files changed, 47 insertions(+), 1 deletion(-)

Comments

Dario Faggioli Sept. 8, 2016, 7:19 a.m. UTC | #1
On Thu, 2016-09-08 at 13:30 +0800, Dongli Zhang wrote:
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index f34dd56..3641469 100644
> @@ -150,6 +152,12 @@ static void populate_physmap(struct memop_args
> *a)
>                              max_order(curr_d)) )
>          return;
>  
> +    /* MEMF_no_tlbflush can be set only during vm creation phase
> when
> +     * already_scheduled is still 0 before this domain gets
> scheduled for
> +     * the first time. */
>
/*
 * Comment style for multi line comments in Xen
 * includes the 'wings'. :-)
 */

Yes, I know there's some inconsistency in this file (and in many others
:-/), but still.

> +    if ( d->already_scheduled == 0 )
>
unlikely() maybe?

> +        a->memflags |= MEMF_no_tlbflush;
> +
>      for ( i = a->nr_done; i < a->nr_extents; i++ )
>      {
>          if ( i != a->nr_done && hypercall_preempt_check() )
> @@ -214,6 +222,21 @@ static void populate_physmap(struct memop_args
> *a)
>                      goto out;
>                  }
>  
> +                if ( d->already_scheduled == 0 )
> +                {
> +                    for ( j = 0; j < (1U << a->extent_order); j++ )
> +                    {
> +                        if ( page[j].u.free.need_tlbflush &&
> +                             (page[j].tlbflush_timestamp <=
> tlbflush_current_time()) &&
> +                             (!need_tlbflush ||
> +                             (page[j].tlbflush_timestamp >
> tlbflush_timestamp)) )
>
This check is long, complicated to read (at least to a non TLBflush
guru), and also appear twice.. can it be put in an inline function with
a talking name?

Oh, and I think you don't need the parenthesis around these twos:

 (page[j].tlbflush_timestamp <= tlbflush_current_time())
 (page[j].tlbflush_timestamp > tlbflush_timestamp)

> +                        {
> +                            need_tlbflush = 1;
> +                            tlbflush_timestamp =
> page[j].tlbflush_timestamp;
> +                        }
> +                    }
> +                }
> +
>                  mfn = page_to_mfn(page);
>              }

> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 32a300f..593541a 100644
> @@ -1376,6 +1376,11 @@ static void schedule(void)
>  
>      next = next_slice.task;
>  
> +    /* Set already_scheduled to 1 when this domain gets scheduled
> for the
> +     * first time */
>
Wings again.

And, about the content, it's already clear from the code that this gets
set when a vcpu of a domain is scheduled. What we want here is a
_quick_ explanation of why we need the scheduler to record this
information.

> +    if ( next->domain->already_scheduled == 0 )
>
unlikely() (and here I'm sure :-)).

> +        next->domain->already_scheduled = 1;
> +
>
And, finally, I'd move this toward the bottom of the function, outside
of the pcpu_schedule_lock() critical section, e.g., around the call to
vcpu_periodic_timer_work(next);

>      sd->curr = next;
>  
>      if ( next_slice.time >= 0 ) /* -ve means no limit */

Regards,
Dario
Wei Liu Sept. 8, 2016, 10:50 a.m. UTC | #2
On Thu, Sep 08, 2016 at 01:30:03PM +0800, Dongli Zhang wrote:
> This patch implemented parts of TODO left in commit id
> a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out
> into populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very
> slow to create a guest with memory size of more than 100GB on host with
> 100+ cpus.
> 
> This patch introduced a "MEMF_no_tlbflush" bit to memflags to indicate
> whether TLB-flush should be done in alloc_heap_pages or its caller
> populate_physmap. Once this bit is set in memflags, alloc_heap_pages will
> ignore TLB-flush. To use this bit after vm is created might lead to
> security issue, that is, this would make pages accessible to the guest B,
> when guest A may still have a cached mapping to them.
> 
> Therefore, this patch also introduced a "already_scheduled" field to struct
> domain to indicate whether this domain has ever got scheduled by
> hypervisor.  MEMF_no_tlbflush can be set only during vm creation phase when
> already_scheduled is still 0 before this domain gets scheduled for the
> first time.
> 
> TODO: ballooning very huge amount of memory cannot benefit from this patch
> and might still be slow.
> 
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> 
> ---
> Changed since v2:
>   * Limit this optimization to domain creation time.
> 
> ---
>  xen/common/domain.c     |  2 ++
>  xen/common/memory.c     | 33 +++++++++++++++++++++++++++++++++
>  xen/common/page_alloc.c |  3 ++-
>  xen/common/schedule.c   |  5 +++++
>  xen/include/xen/mm.h    |  2 ++
>  xen/include/xen/sched.h |  3 +++
>  6 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index a8804e4..611a471 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -303,6 +303,8 @@ struct domain *domain_create(domid_t domid, unsigned int domcr_flags,
>      if ( !zalloc_cpumask_var(&d->domain_dirty_cpumask) )
>          goto fail;
>  
> +    d->already_scheduled = 0;
> +

Use false please -- this is a bool_t.

[...]
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 32a300f..593541a 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -1376,6 +1376,11 @@ static void schedule(void)
>  
>      next = next_slice.task;
>  
> +    /* Set already_scheduled to 1 when this domain gets scheduled for the
> +     * first time */
> +    if ( next->domain->already_scheduled == 0 )
> +        next->domain->already_scheduled = 1;
> +

Can be simplified by omitting the "if" altogether.  And use "true" here.

Wei.
Dario Faggioli Sept. 8, 2016, 11:01 a.m. UTC | #3
On Thu, 2016-09-08 at 11:50 +0100, Wei Liu wrote:
> On Thu, Sep 08, 2016 at 01:30:03PM +0800, Dongli Zhang wrote:
> > 
> > diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> > index 32a300f..593541a 100644
> > --- a/xen/common/schedule.c
> > +++ b/xen/common/schedule.c
> > @@ -1376,6 +1376,11 @@ static void schedule(void)
> >  
> >      next = next_slice.task;
> >  
> > +    /* Set already_scheduled to 1 when this domain gets scheduled
> > for the
> > +     * first time */
> > +    if ( next->domain->already_scheduled == 0 )
> > +        next->domain->already_scheduled = 1;
> > +
> Can be simplified by omitting the "if" altogether.  
>
Are you sure? I mean looking at the cases when the flag is already true
(which means, during the life of a domain, basically **always** except
a handful of instances after creation), what costs less, a check that
is always false, or a write that is always updating a value with its
current value?

And I'm not being ironic or anything, I honestly am not sure and this
is a genuine question.

> And use "true" here.
> 
Yeah, or just:

 if ( unlikely(!next->domain->already_scheduled) )
     ...

> Wei.
Wei Liu Sept. 8, 2016, 11:11 a.m. UTC | #4
On Thu, Sep 08, 2016 at 01:01:40PM +0200, Dario Faggioli wrote:
> On Thu, 2016-09-08 at 11:50 +0100, Wei Liu wrote:
> > On Thu, Sep 08, 2016 at 01:30:03PM +0800, Dongli Zhang wrote:
> > > 
> > > diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> > > index 32a300f..593541a 100644
> > > --- a/xen/common/schedule.c
> > > +++ b/xen/common/schedule.c
> > > @@ -1376,6 +1376,11 @@ static void schedule(void)
> > >  
> > >      next = next_slice.task;
> > >  
> > > +    /* Set already_scheduled to 1 when this domain gets scheduled
> > > for the
> > > +     * first time */
> > > +    if ( next->domain->already_scheduled == 0 )
> > > +        next->domain->already_scheduled = 1;
> > > +
> > Can be simplified by omitting the "if" altogether.  
> >
> Are you sure? I mean looking at the cases when the flag is already true
> (which means, during the life of a domain, basically **always** except
> a handful of instances after creation), what costs less, a check that
> is always false, or a write that is always updating a value with its
> current value?

Omitting the check certain results in less instructions. And it would
probably eliminate misses in instruction cache and branch prediction
logic in the processor.

In the grand scheme of things, this is a rather minor optimisation, so I
wouldn't argue strongly for this.

Wei.

> 
> And I'm not being ironic or anything, I honestly am not sure and this
> is a genuine question.
> 
> > And use "true" here.
> > 
> Yeah, or just:
> 
>  if ( unlikely(!next->domain->already_scheduled) )
>      ...
> 
> > Wei.
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>
Jürgen Groß Sept. 8, 2016, 11:19 a.m. UTC | #5
On 08/09/16 13:11, Wei Liu wrote:
> On Thu, Sep 08, 2016 at 01:01:40PM +0200, Dario Faggioli wrote:
>> On Thu, 2016-09-08 at 11:50 +0100, Wei Liu wrote:
>>> On Thu, Sep 08, 2016 at 01:30:03PM +0800, Dongli Zhang wrote:
>>>>  
>>>> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
>>>> index 32a300f..593541a 100644
>>>> --- a/xen/common/schedule.c
>>>> +++ b/xen/common/schedule.c
>>>> @@ -1376,6 +1376,11 @@ static void schedule(void)
>>>>  
>>>>      next = next_slice.task;
>>>>  
>>>> +    /* Set already_scheduled to 1 when this domain gets scheduled
>>>> for the
>>>> +     * first time */
>>>> +    if ( next->domain->already_scheduled == 0 )
>>>> +        next->domain->already_scheduled = 1;
>>>> +
>>> Can be simplified by omitting the "if" altogether.  
>>>
>> Are you sure? I mean looking at the cases when the flag is already true
>> (which means, during the life of a domain, basically **always** except
>> a handful of instances after creation), what costs less, a check that
>> is always false, or a write that is always updating a value with its
>> current value?
> 
> Omitting the check certain results in less instructions. And it would
> probably eliminate misses in instruction cache and branch prediction
> logic in the processor.

The first scheduling is done via unpausing the domain. Why not setting
the flag to true in that path?

Juergen
Ian Jackson Sept. 8, 2016, 2:28 p.m. UTC | #6
Wei Liu writes ("Re: [PATCH v3 1/1] xen: move TLB-flush filtering out into populate_physmap during vm creation"):
> On Thu, Sep 08, 2016 at 01:01:40PM +0200, Dario Faggioli wrote:
> > On Thu, 2016-09-08 at 11:50 +0100, Wei Liu wrote:
> > > On Thu, Sep 08, 2016 at 01:30:03PM +0800, Dongli Zhang wrote:
> > > > +    if ( next->domain->already_scheduled == 0 )
> > > > +        next->domain->already_scheduled = 1;
> > > > +
> > > Can be simplified by omitting the "if" altogether.  
> > >
> > Are you sure? I mean looking at the cases when the flag is already true
> > (which means, during the life of a domain, basically **always** except
> > a handful of instances after creation), what costs less, a check that
> > is always false, or a write that is always updating a value with its
> > current value?
> 
> Omitting the check certain results in less instructions. And it would
> probably eliminate misses in instruction cache and branch prediction
> logic in the processor.
> 
> In the grand scheme of things, this is a rather minor optimisation, so I
> wouldn't argue strongly for this.

Are we sure we ought to be discussing this in terms of optimisation ?
I doubt it makes any significant difference either way.

But there is a difference in clarity.  I would not normally expect to
see this:

   bool x;

   ...

   if (!x)
       x = 1;

If I saw that I would wonder if the programmer was confused, or
whether I was missing something.

Looking at it without the benefit of the definition of x, it looks
more like x might be a non-boolean type.

Thanks,
Ian.
Dario Faggioli Sept. 8, 2016, 2:49 p.m. UTC | #7
On Thu, 2016-09-08 at 13:19 +0200, Juergen Gross wrote:
> The first scheduling is done via unpausing the domain. Why not
> setting
> the flag to true in that path?
> 
That could be a good idea.

And in general, I'm all for finding a place and/or a state that better
represents the condition of "setting to run this vcpu for the first
time" and set the flag there, rather than re-assigning 1 to an
"already_scheduled" flag each an every time we go through schedule()
(and not for performance and optimization reason).

Which domain_unpause() (or whatever) do you have in mind precisely?

Thanks and Regards,
Dario
Andrew Cooper Sept. 8, 2016, 2:52 p.m. UTC | #8
On 08/09/16 15:49, Dario Faggioli wrote:
> On Thu, 2016-09-08 at 13:19 +0200, Juergen Gross wrote:
>> The first scheduling is done via unpausing the domain. Why not
>> setting
>> the flag to true in that path?
>>
> That could be a good idea.
>
> And in general, I'm all for finding a place and/or a state that better
> represents the condition of "setting to run this vcpu for the first
> time" and set the flag there, rather than re-assigning 1 to an
> "already_scheduled" flag each an every time we go through schedule()
> (and not for performance and optimization reason).
>
> Which domain_unpause() (or whatever) do you have in mind precisely?

Domains are created with a single systemcontroller pause refcount, which
must be undone by the use of XEN_DOMCTL_unpausedomain when construction
is complete.

That would be the appropriate place to set such a variable.

~Andrew
Jan Beulich Sept. 8, 2016, 3:36 p.m. UTC | #9
>>> On 08.09.16 at 07:30, <dongli.zhang@oracle.com> wrote:
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -474,6 +474,9 @@ struct domain
>          unsigned int guest_request_enabled       : 1;
>          unsigned int guest_request_sync          : 1;
>      } monitor;
> +
> +    /* set to 1 the first time this domain gets scheduled. */
> +    bool_t already_scheduled;

Did you go through and check that there is nothing this information
can already get derived from? I can't immediately point you at
anything, but it feels like there should. And if indeed there isn't,
then - to extend on someone else's comments (I think it was Dario)
- please use plain bool in new additions.

Jan
Dario Faggioli Sept. 8, 2016, 3:53 p.m. UTC | #10
On Thu, 2016-09-08 at 09:36 -0600, Jan Beulich wrote:
> > 
> > > 
> > > > 
> > > > On 08.09.16 at 07:30, <dongli.zhang@oracle.com> wrote:
> > --- a/xen/include/xen/sched.h
> > +++ b/xen/include/xen/sched.h
> > @@ -474,6 +474,9 @@ struct domain
> >          unsigned int guest_request_enabled       : 1;
> >          unsigned int guest_request_sync          : 1;
> >      } monitor;
> > +
> > +    /* set to 1 the first time this domain gets scheduled. */
> > +    bool_t already_scheduled;
> Did you go through and check that there is nothing this information
> can already get derived from? I can't immediately point you at
> anything, but it feels like there should. 
>
Indeed. And if there isn't and we need to do add our own flagging,
isn't there a better way and place where to put it (e.g., what Juergen
and Andrew are hinting at)?

> And if indeed there isn't,
> then - to extend on someone else's comments (I think it was Dario)
> - please use plain bool in new additions.
> 
It's Wei that commented about bool-s use in the patch. :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
diff mbox

Patch

diff --git a/xen/common/domain.c b/xen/common/domain.c
index a8804e4..611a471 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -303,6 +303,8 @@  struct domain *domain_create(domid_t domid, unsigned int domcr_flags,
     if ( !zalloc_cpumask_var(&d->domain_dirty_cpumask) )
         goto fail;
 
+    d->already_scheduled = 0;
+
     if ( domcr_flags & DOMCRF_hvm )
         d->guest_type = guest_type_hvm;
     else if ( domcr_flags & DOMCRF_pvh )
diff --git a/xen/common/memory.c b/xen/common/memory.c
index f34dd56..3641469 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -141,6 +141,8 @@  static void populate_physmap(struct memop_args *a)
     unsigned int i, j;
     xen_pfn_t gpfn, mfn;
     struct domain *d = a->domain, *curr_d = current->domain;
+    bool_t need_tlbflush = 0;
+    uint32_t tlbflush_timestamp = 0;
 
     if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done,
                                      a->nr_extents-1) )
@@ -150,6 +152,12 @@  static void populate_physmap(struct memop_args *a)
                             max_order(curr_d)) )
         return;
 
+    /* MEMF_no_tlbflush can be set only during vm creation phase when
+     * already_scheduled is still 0 before this domain gets scheduled for
+     * the first time. */
+    if ( d->already_scheduled == 0 )
+        a->memflags |= MEMF_no_tlbflush;
+
     for ( i = a->nr_done; i < a->nr_extents; i++ )
     {
         if ( i != a->nr_done && hypercall_preempt_check() )
@@ -214,6 +222,21 @@  static void populate_physmap(struct memop_args *a)
                     goto out;
                 }
 
+                if ( d->already_scheduled == 0 )
+                {
+                    for ( j = 0; j < (1U << a->extent_order); j++ )
+                    {
+                        if ( page[j].u.free.need_tlbflush &&
+                             (page[j].tlbflush_timestamp <= tlbflush_current_time()) &&
+                             (!need_tlbflush ||
+                             (page[j].tlbflush_timestamp > tlbflush_timestamp)) )
+                        {
+                            need_tlbflush = 1;
+                            tlbflush_timestamp = page[j].tlbflush_timestamp;
+                        }
+                    }
+                }
+
                 mfn = page_to_mfn(page);
             }
 
@@ -232,6 +255,16 @@  static void populate_physmap(struct memop_args *a)
     }
 
 out:
+    if ( need_tlbflush )
+    {
+        cpumask_t mask = cpu_online_map;
+        tlbflush_filter(mask, tlbflush_timestamp);
+        if ( !cpumask_empty(&mask) )
+        {
+            perfc_incr(need_flush_tlb_flush);
+            flush_tlb_mask(&mask);
+        }
+    }
     a->nr_done = i;
 }
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 18ff6cf..e0283fc 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -827,7 +827,8 @@  static struct page_info *alloc_heap_pages(
         BUG_ON(pg[i].count_info != PGC_state_free);
         pg[i].count_info = PGC_state_inuse;
 
-        if ( pg[i].u.free.need_tlbflush &&
+        if ( !(memflags & MEMF_no_tlbflush) &&
+             pg[i].u.free.need_tlbflush &&
              (pg[i].tlbflush_timestamp <= tlbflush_current_time()) &&
              (!need_tlbflush ||
               (pg[i].tlbflush_timestamp > tlbflush_timestamp)) )
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 32a300f..593541a 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1376,6 +1376,11 @@  static void schedule(void)
 
     next = next_slice.task;
 
+    /* Set already_scheduled to 1 when this domain gets scheduled for the
+     * first time */
+    if ( next->domain->already_scheduled == 0 )
+        next->domain->already_scheduled = 1;
+
     sd->curr = next;
 
     if ( next_slice.time >= 0 ) /* -ve means no limit */
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 58bc0b8..880ca88 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -221,6 +221,8 @@  struct npfec {
 #define  MEMF_exact_node  (1U<<_MEMF_exact_node)
 #define _MEMF_no_owner    5
 #define  MEMF_no_owner    (1U<<_MEMF_no_owner)
+#define _MEMF_no_tlbflush 6
+#define  MEMF_no_tlbflush (1U<<_MEMF_no_tlbflush)
 #define _MEMF_node        8
 #define  MEMF_node_mask   ((1U << (8 * sizeof(nodeid_t))) - 1)
 #define  MEMF_node(n)     ((((n) + 1) & MEMF_node_mask) << _MEMF_node)
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 2f9c15f..cbd8329 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -474,6 +474,9 @@  struct domain
         unsigned int guest_request_enabled       : 1;
         unsigned int guest_request_sync          : 1;
     } monitor;
+
+    /* set to 1 the first time this domain gets scheduled. */
+    bool_t already_scheduled;
 };
 
 /* Protect updates/reads (resp.) of domain_list and domain_hash. */