[05/24] xen: credit2: make tickling more deterministic

Message ID	147145428827.25877.8612560340138019986.stgit@Solace.fritz.box (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> From: Dario Faggioli <dario.faggioli@citrix.com> To: xen-devel@lists.xenproject.org Date: Wed, 17 Aug 2016 19:18:08 +0200 Message-ID: <147145428827.25877.8612560340138019986.stgit@Solace.fritz.box> In-Reply-To: <147145358844.25877.7490417583264534196.stgit@Solace.fritz.box> References: <147145358844.25877.7490417583264534196.stgit@Solace.fritz.box> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Anshul Makkar <anshul.makkar@citrix.com>, George Dunlap <george.dunlap@citrix.com>, Jan Beulich <JBeulich@suse.com> Subject: [Xen-devel] [PATCH 05/24] xen: credit2: make tickling more deterministic Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Dario Faggioli Aug. 17, 2016, 5:18 p.m. UTC

Right now, the following scenario can occurr:
 - upon vcpu v wakeup, v itself is put in the runqueue,
   and pcpu X is tickled;
 - pcpu Y schedules (for whatever reason), sees v in
   the runqueue and picks it up.

This may seem ok (or even a good thing), but it's not.
In fact, if runq_tickle() decided X is where v should
run, it did it for a reason (load distribution, SMT
support, cache hotness, affinity, etc), and we really
should try as hard as possible to stick to that.

Of course, we can't be too strict, or we risk leaving
vcpus in the runqueue while there is available CPU
capacity. So, we only leave v in runqueue --for X to
pick it up-- if we see that X has been tickled and
has not scheduled yet, i.e., it will have a real chance
of actually select and schedule v.

If that is not the case, we schedule it on Y (or, at
least, we consider that), as running somewhere non-ideal
is better than not running at all.

The commit also adds performance counters for each of
the possible situations.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/common/sched_credit2.c   |   65 +++++++++++++++++++++++++++++++++++++++---
 xen/include/xen/perfc_defn.h |    3 ++
 2 files changed, 64 insertions(+), 4 deletions(-)

Anshul Makkar Aug. 31, 2016, 5:10 p.m. UTC | #1

On 17/08/16 18:18, Dario Faggioli wrote:
> Right now, the following scenario can occurr:
>   - upon vcpu v wakeup, v itself is put in the runqueue,
>     and pcpu X is tickled;
>   - pcpu Y schedules (for whatever reason), sees v in
>     the runqueue and picks it up.
>
> This may seem ok (or even a good thing), but it's not.
> In fact, if runq_tickle() decided X is where v should
> run, it did it for a reason (load distribution, SMT
> support, cache hotness, affinity, etc), and we really
> should try as hard as possible to stick to that.
>
> Of course, we can't be too strict, or we risk leaving
> vcpus in the runqueue while there is available CPU
> capacity. So, we only leave v in runqueue --for X to
> pick it up-- if we see that X has been tickled and
> has not scheduled yet, i.e., it will have a real chance
> of actually select and schedule v.
>
> If that is not the case, we schedule it on Y (or, at
> least, we consider that), as running somewhere non-ideal
> is better than not running at all.
>
> The commit also adds performance counters for each of
> the possible situations.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>   xen/common/sched_credit2.c   |   65 +++++++++++++++++++++++++++++++++++++++---
>   xen/include/xen/perfc_defn.h |    3 ++
>   2 files changed, 64 insertions(+), 4 deletions(-)
>
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 12dfd20..a3d7beb 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -54,6 +54,7 @@
>   #define TRC_CSCHED2_LOAD_CHECK       TRC_SCHED_CLASS_EVT(CSCHED2, 16)
>   #define TRC_CSCHED2_LOAD_BALANCE     TRC_SCHED_CLASS_EVT(CSCHED2, 17)
>   #define TRC_CSCHED2_PICKED_CPU       TRC_SCHED_CLASS_EVT(CSCHED2, 19)
> +#define TRC_CSCHED2_RUNQ_CANDIDATE   TRC_SCHED_CLASS_EVT(CSCHED2, 20)
>
>   /*
>    * WARNING: This is still in an experimental phase.  Status and work can be found at the
> @@ -398,6 +399,7 @@ struct csched2_vcpu {
>       int credit;
>       s_time_t start_time; /* When we were scheduled (used for credit) */
>       unsigned flags;      /* 16 bits doesn't seem to play well with clear_bit() */
> +    int tickled_cpu;     /* cpu tickled for picking us up (-1 if none) */
>
>       /* Individual contribution to load */
>       s_time_t load_last_update;  /* Last time average was updated */
> @@ -1049,6 +1051,10 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>       __cpumask_set_cpu(ipid, &rqd->tickled);
>       smt_idle_mask_clear(ipid, &rqd->smt_idle);
>       cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
> +
> +    if ( unlikely(new->tickled_cpu != -1) )
> +        SCHED_STAT_CRANK(tickled_cpu_overwritten);
> +    new->tickled_cpu = ipid;
>   }
>
>   /*
> @@ -1266,6 +1272,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
>           ASSERT(svc->sdom != NULL);
>           svc->credit = CSCHED2_CREDIT_INIT;
>           svc->weight = svc->sdom->weight;
> +        svc->tickled_cpu = -1;
>           /* Starting load of 50% */
>           svc->avgload = 1ULL << (CSCHED2_PRIV(ops)->load_precision_shift - 1);
>           svc->load_last_update = NOW() >> LOADAVG_GRANULARITY_SHIFT;
> @@ -1273,6 +1280,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
>       else
>       {
>           ASSERT(svc->sdom == NULL);
> +        svc->tickled_cpu = svc->vcpu->vcpu_id;
If I understood correctly, tickled_cpu refers to pcpu and not a vcpu. 
Saving vcpu_id in tickled_cpu looks wrong.

>           svc->credit = CSCHED2_IDLE_CREDIT;
>           svc->weight = 0;
>       }
> @@ -2233,7 +2241,8 @@ void __dump_execstate(void *unused);
>   static struct csched2_vcpu *
>   runq_candidate(struct csched2_runqueue_data *rqd,
>                  struct csched2_vcpu *scurr,
> -               int cpu, s_time_t now)
> +               int cpu, s_time_t now,
> +               unsigned int *pos)
>   {
>       struct list_head *iter;
>       struct csched2_vcpu *snext = NULL;
> @@ -2262,13 +2271,29 @@ runq_candidate(struct csched2_runqueue_data *rqd,
>
>           /* Only consider vcpus that are allowed to run on this processor. */
>           if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
> +        {
> +            (*pos)++;
>               continue;
> +        }
> +
> +        /*
> +         * If a vcpu is meant to be picked up by another processor, and such
> +         * processor has not scheduled yet, leave it in the runqueue for him.
> +         */
> +        if ( svc->tickled_cpu != -1 && svc->tickled_cpu != cpu &&
> +             cpumask_test_cpu(svc->tickled_cpu, &rqd->tickled) )
> +        {
> +            (*pos)++;
> +            SCHED_STAT_CRANK(deferred_to_tickled_cpu);
> +            continue;
> +        }
>
>           /* If this is on a different processor, don't pull it unless
>            * its credit is at least CSCHED2_MIGRATE_RESIST higher. */
>           if ( svc->vcpu->processor != cpu
>                && snext->credit + CSCHED2_MIGRATE_RESIST > svc->credit )
>           {
> +            (*pos)++;
>               SCHED_STAT_CRANK(migrate_resisted);
>               continue;
>           }
> @@ -2280,9 +2305,26 @@ runq_candidate(struct csched2_runqueue_data *rqd,
>
>           /* In any case, if we got this far, break. */
>           break;
> +    }
>
> +    if ( unlikely(tb_init_done) )
> +    {
> +        struct {
> +            unsigned vcpu:16, dom:16;
> +            unsigned tickled_cpu, position;
> +        } d;
> +        d.dom = snext->vcpu->domain->domain_id;
> +        d.vcpu = snext->vcpu->vcpu_id;
> +        d.tickled_cpu = snext->tickled_cpu;
> +        d.position = *pos;
> +        __trace_var(TRC_CSCHED2_RUNQ_CANDIDATE, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>       }
>
> +    if ( unlikely(snext->tickled_cpu != -1 && snext->tickled_cpu != cpu) )
> +        SCHED_STAT_CRANK(tickled_cpu_overridden);
> +
>       return snext;
>   }
>
> @@ -2298,6 +2340,7 @@ csched2_schedule(
>       struct csched2_runqueue_data *rqd;
>       struct csched2_vcpu * const scurr = CSCHED2_VCPU(current);
>       struct csched2_vcpu *snext = NULL;
> +    unsigned int snext_pos = 0;
>       struct task_slice ret;
>
>       SCHED_STAT_CRANK(schedule);
> @@ -2347,7 +2390,7 @@ csched2_schedule(
>           snext = CSCHED2_VCPU(idle_vcpu[cpu]);
>       }
>       else
> -        snext=runq_candidate(rqd, scurr, cpu, now);
> +        snext = runq_candidate(rqd, scurr, cpu, now, &snext_pos);
>
>       /* If switching from a non-idle runnable vcpu, put it
>        * back on the runqueue. */
> @@ -2371,8 +2414,21 @@ csched2_schedule(
>               __set_bit(__CSFLAG_scheduled, &snext->flags);
>           }
>
> -        /* Check for the reset condition */
> -        if ( snext->credit <= CSCHED2_CREDIT_RESET )
> +        /*
> +         * The reset condition is "has a scheduler epoch come to an end?".
> +         * The way this is enforced is checking whether the vcpu at the top
> +         * of the runqueue has negative credits. This means the epochs have
> +         * variable lenght, as in one epoch expores when:
> +         *  1) the vcpu at the top of the runqueue has executed for
> +         *     around 10 ms (with default parameters);
> +         *  2) no other vcpu with higher credits wants to run.
> +         *
> +         * Here, where we want to check for reset, we need to make sure the
> +         * proper vcpu is being used. In fact, runqueue_candidate() may have
> +         * not returned the first vcpu in the runqueue, for various reasons
> +         * (e.g., affinity). Only trigger a reset when it does.
> +         */
> +        if ( snext_pos == 0 && snext->credit <= CSCHED2_CREDIT_RESET )
>           {
>               reset_credit(ops, cpu, now, snext);
>               balance_load(ops, cpu, now);
> @@ -2386,6 +2442,7 @@ csched2_schedule(
>           }
>
>           snext->start_time = now;
> +        snext->tickled_cpu = -1;
>
>           /* Safe because lock for old processor is held */
>           if ( snext->vcpu->processor != cpu )
> diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
> index a336c71..4a835b8 100644
> --- a/xen/include/xen/perfc_defn.h
> +++ b/xen/include/xen/perfc_defn.h
> @@ -66,6 +66,9 @@ PERFCOUNTER(runtime_max_timer,      "csched2: runtime_max_timer")
>   PERFCOUNTER(migrated,               "csched2: migrated")
>   PERFCOUNTER(migrate_resisted,       "csched2: migrate_resisted")
>   PERFCOUNTER(credit_reset,           "csched2: credit_reset")
> +PERFCOUNTER(deferred_to_tickled_cpu,"csched2: deferred_to_tickled_cpu")
> +PERFCOUNTER(tickled_cpu_overwritten,"csched2: tickled_cpu_overwritten")
> +PERFCOUNTER(tickled_cpu_overridden, "csched2: tickled_cpu_overridden")
>
>   PERFCOUNTER(need_flush_tlb_flush,   "PG_need_flush tlb flushes")
>
>Anshul

Dario Faggioli Sept. 5, 2016, 1:47 p.m. UTC | #2

On Wed, 2016-08-31 at 18:10 +0100, anshul makkar wrote:
> On 17/08/16 18:18, Dario Faggioli wrote:
> > 
> > Right now, the following scenario can occurr:
> >   - upon vcpu v wakeup, v itself is put in the runqueue,
> >     and pcpu X is tickled;
> >   - pcpu Y schedules (for whatever reason), sees v in
> >     the runqueue and picks it up.
> > 
> > This may seem ok (or even a good thing), but it's not.
> > In fact, if runq_tickle() decided X is where v should
> > run, it did it for a reason (load distribution, SMT
> > support, cache hotness, affinity, etc), and we really
> > should try as hard as possible to stick to that.
> > 
> > Of course, we can't be too strict, or we risk leaving
> > vcpus in the runqueue while there is available CPU
> > capacity. So, we only leave v in runqueue --for X to
> > pick it up-- if we see that X has been tickled and
> > has not scheduled yet, i.e., it will have a real chance
> > of actually select and schedule v.
> > 
> > If that is not the case, we schedule it on Y (or, at
> > least, we consider that), as running somewhere non-ideal
> > is better than not running at all.
> > 
> > The commit also adds performance counters for each of
> > the possible situations.
> > 
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > ---
> > Cc: George Dunlap <george.dunlap@citrix.com>
> > Cc: Anshul Makkar <anshul.makkar@citrix.com>
> > Cc: Jan Beulich <JBeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > ---
> >   xen/common/sched_credit2.c   |   65
> > +++++++++++++++++++++++++++++++++++++++---
> >   xen/include/xen/perfc_defn.h |    3 ++
> >   2 files changed, 64 insertions(+), 4 deletions(-)
> > 
> > diff --git a/xen/common/sched_credit2.c
> > b/xen/common/sched_credit2.c
> > index 12dfd20..a3d7beb 100644
> > --- a/xen/common/sched_credit2.c
> > +++ b/xen/common/sched_credit2.c
> > @@ -54,6 +54,7 @@
> >   #define TRC_CSCHED2_LOAD_CHECK       TRC_SCHED_CLASS_EVT(CSCHED2,
> > 16)
> >   #define TRC_CSCHED2_LOAD_BALANCE     TRC_SCHED_CLASS_EVT(CSCHED2,
> > 17)
> >   #define TRC_CSCHED2_PICKED_CPU       TRC_SCHED_CLASS_EVT(CSCHED2,
> > 19)
> > +#define TRC_CSCHED2_RUNQ_CANDIDATE   TRC_SCHED_CLASS_EVT(CSCHED2,
> > 20)
> > 
> >   /*
> >    * WARNING: This is still in an experimental phase.  Status and
> > work can be found at the
> > @@ -398,6 +399,7 @@ struct csched2_vcpu {
> >       int credit;
> >       s_time_t start_time; /* When we were scheduled (used for
> > credit) */
> >       unsigned flags;      /* 16 bits doesn't seem to play well
> > with clear_bit() */
> > +    int tickled_cpu;     /* cpu tickled for picking us up (-1 if
> > none) */
> > 
> >       /* Individual contribution to load */
> >       s_time_t load_last_update;  /* Last time average was updated
> > */
> > @@ -1049,6 +1051,10 @@ runq_tickle(const struct scheduler *ops,
> > struct csched2_vcpu *new, s_time_t now)
> >       __cpumask_set_cpu(ipid, &rqd->tickled);
> >       smt_idle_mask_clear(ipid, &rqd->smt_idle);
> >       cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
> > +
> > +    if ( unlikely(new->tickled_cpu != -1) )
> > +        SCHED_STAT_CRANK(tickled_cpu_overwritten);
> > +    new->tickled_cpu = ipid;
> >   }
> > 
> >   /*
> > @@ -1266,6 +1272,7 @@ csched2_alloc_vdata(const struct scheduler
> > *ops, struct vcpu *vc, void *dd)
> >           ASSERT(svc->sdom != NULL);
> >           svc->credit = CSCHED2_CREDIT_INIT;
> >           svc->weight = svc->sdom->weight;
> > +        svc->tickled_cpu = -1;
> >           /* Starting load of 50% */
> >           svc->avgload = 1ULL << (CSCHED2_PRIV(ops)-
> > >load_precision_shift - 1);
> >           svc->load_last_update = NOW() >>
> > LOADAVG_GRANULARITY_SHIFT;
> > @@ -1273,6 +1280,7 @@ csched2_alloc_vdata(const struct scheduler
> > *ops, struct vcpu *vc, void *dd)
> >       else
> >       {
> >           ASSERT(svc->sdom == NULL);
> > +        svc->tickled_cpu = svc->vcpu->vcpu_id;
> If I understood correctly, tickled_cpu refers to pcpu and not a
> vcpu. 
> Saving vcpu_id in tickled_cpu looks wrong.
> 
Yes, and in fact, as you can see in the previous hunk, for pretty much
all vcpus, tickled_cpu is initialized to -1.

Here, we are dealing with the vcpus of the idle domain. And for vcpus
of the idle domain, their vcpu id is the same as the id of the pcpu
they're associated to.

I agree it looks a little bit weird, but it's both correct, and the
easiest and cleanest way for initializing this.

I guess I can add a comment...

Thanks and Regards,
Dario

Anshul Makkar Sept. 7, 2016, 12:25 p.m. UTC | #3

On 05/09/16 14:47, Dario Faggioli wrote:
> On Wed, 2016-08-31 at 18:10 +0100, anshul makkar wrote:
>> On 17/08/16 18:18, Dario Faggioli wrote:
>>>
>>> @@ -1266,6 +1272,7 @@ csched2_alloc_vdata(const struct scheduler
>>> *ops, struct vcpu *vc, void *dd)
>>>            ASSERT(svc->sdom != NULL);
>>>            svc->credit = CSCHED2_CREDIT_INIT;
>>>            svc->weight = svc->sdom->weight;
>>> +        svc->tickled_cpu = -1;
>>>            /* Starting load of 50% */
>>>            svc->avgload = 1ULL << (CSCHED2_PRIV(ops)-
>>>> load_precision_shift - 1);
>>>            svc->load_last_update = NOW() >>
>>> LOADAVG_GRANULARITY_SHIFT;
>>> @@ -1273,6 +1280,7 @@ csched2_alloc_vdata(const struct scheduler
>>> *ops, struct vcpu *vc, void *dd)
>>>        else
>>>        {
>>>            ASSERT(svc->sdom == NULL);
>>> +        svc->tickled_cpu = svc->vcpu->vcpu_id;
>> If I understood correctly, tickled_cpu refers to pcpu and not a
>> vcpu.
>> Saving vcpu_id in tickled_cpu looks wrong.
>>
> Yes, and in fact, as you can see in the previous hunk, for pretty much
> all vcpus, tickled_cpu is initialized to -1.
>
> Here, we are dealing with the vcpus of the idle domain. And for vcpus
> of the idle domain, their vcpu id is the same as the id of the pcpu
> they're associated to.
Ah, that makes it clear .
>
> I agree it looks a little bit weird, but it's both correct, and the
> easiest and cleanest way for initializing this.
>
> I guess I can add a comment...
That will be useful.
>
> Thanks and Regards,
> Dario
>
Thanks
Anshul

George Dunlap Sept. 13, 2016, 11:13 a.m. UTC | #4

On 05/09/16 14:47, Dario Faggioli wrote:
> On Wed, 2016-08-31 at 18:10 +0100, anshul makkar wrote:
>> On 17/08/16 18:18, Dario Faggioli wrote:
>>>
>>> Right now, the following scenario can occurr:
>>>   - upon vcpu v wakeup, v itself is put in the runqueue,
>>>     and pcpu X is tickled;
>>>   - pcpu Y schedules (for whatever reason), sees v in
>>>     the runqueue and picks it up.
>>>
>>> This may seem ok (or even a good thing), but it's not.
>>> In fact, if runq_tickle() decided X is where v should
>>> run, it did it for a reason (load distribution, SMT
>>> support, cache hotness, affinity, etc), and we really
>>> should try as hard as possible to stick to that.
>>>
>>> Of course, we can't be too strict, or we risk leaving
>>> vcpus in the runqueue while there is available CPU
>>> capacity. So, we only leave v in runqueue --for X to
>>> pick it up-- if we see that X has been tickled and
>>> has not scheduled yet, i.e., it will have a real chance
>>> of actually select and schedule v.
>>>
>>> If that is not the case, we schedule it on Y (or, at
>>> least, we consider that), as running somewhere non-ideal
>>> is better than not running at all.
>>>
>>> The commit also adds performance counters for each of
>>> the possible situations.
>>>
>>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>>> ---
>>> Cc: George Dunlap <george.dunlap@citrix.com>
>>> Cc: Anshul Makkar <anshul.makkar@citrix.com>
>>> Cc: Jan Beulich <JBeulich@suse.com>
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>>   xen/common/sched_credit2.c   |   65
>>> +++++++++++++++++++++++++++++++++++++++---
>>>   xen/include/xen/perfc_defn.h |    3 ++
>>>   2 files changed, 64 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/xen/common/sched_credit2.c
>>> b/xen/common/sched_credit2.c
>>> index 12dfd20..a3d7beb 100644
>>> --- a/xen/common/sched_credit2.c
>>> +++ b/xen/common/sched_credit2.c
>>> @@ -54,6 +54,7 @@
>>>   #define TRC_CSCHED2_LOAD_CHECK       TRC_SCHED_CLASS_EVT(CSCHED2,
>>> 16)
>>>   #define TRC_CSCHED2_LOAD_BALANCE     TRC_SCHED_CLASS_EVT(CSCHED2,
>>> 17)
>>>   #define TRC_CSCHED2_PICKED_CPU       TRC_SCHED_CLASS_EVT(CSCHED2,
>>> 19)
>>> +#define TRC_CSCHED2_RUNQ_CANDIDATE   TRC_SCHED_CLASS_EVT(CSCHED2,
>>> 20)
>>>
>>>   /*
>>>    * WARNING: This is still in an experimental phase.  Status and
>>> work can be found at the
>>> @@ -398,6 +399,7 @@ struct csched2_vcpu {
>>>       int credit;
>>>       s_time_t start_time; /* When we were scheduled (used for
>>> credit) */
>>>       unsigned flags;      /* 16 bits doesn't seem to play well
>>> with clear_bit() */
>>> +    int tickled_cpu;     /* cpu tickled for picking us up (-1 if
>>> none) */
>>>
>>>       /* Individual contribution to load */
>>>       s_time_t load_last_update;  /* Last time average was updated
>>> */
>>> @@ -1049,6 +1051,10 @@ runq_tickle(const struct scheduler *ops,
>>> struct csched2_vcpu *new, s_time_t now)
>>>       __cpumask_set_cpu(ipid, &rqd->tickled);
>>>       smt_idle_mask_clear(ipid, &rqd->smt_idle);
>>>       cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
>>> +
>>> +    if ( unlikely(new->tickled_cpu != -1) )
>>> +        SCHED_STAT_CRANK(tickled_cpu_overwritten);
>>> +    new->tickled_cpu = ipid;
>>>   }
>>>
>>>   /*
>>> @@ -1266,6 +1272,7 @@ csched2_alloc_vdata(const struct scheduler
>>> *ops, struct vcpu *vc, void *dd)
>>>           ASSERT(svc->sdom != NULL);
>>>           svc->credit = CSCHED2_CREDIT_INIT;
>>>           svc->weight = svc->sdom->weight;
>>> +        svc->tickled_cpu = -1;
>>>           /* Starting load of 50% */
>>>           svc->avgload = 1ULL << (CSCHED2_PRIV(ops)-
>>>> load_precision_shift - 1);
>>>           svc->load_last_update = NOW() >>
>>> LOADAVG_GRANULARITY_SHIFT;
>>> @@ -1273,6 +1280,7 @@ csched2_alloc_vdata(const struct scheduler
>>> *ops, struct vcpu *vc, void *dd)
>>>       else
>>>       {
>>>           ASSERT(svc->sdom == NULL);
>>> +        svc->tickled_cpu = svc->vcpu->vcpu_id;
>> If I understood correctly, tickled_cpu refers to pcpu and not a
>> vcpu. 
>> Saving vcpu_id in tickled_cpu looks wrong.
>>
> Yes, and in fact, as you can see in the previous hunk, for pretty much
> all vcpus, tickled_cpu is initialized to -1.
> 
> Here, we are dealing with the vcpus of the idle domain. And for vcpus
> of the idle domain, their vcpu id is the same as the id of the pcpu
> they're associated to.

But what I haven't sussed out yet is why we need to initialize this for
the idle domain at all.  What benefit does it give you, and what effect
does it have?

 -George

George Dunlap Sept. 13, 2016, 11:28 a.m. UTC | #5

On 17/08/16 18:18, Dario Faggioli wrote:
> Right now, the following scenario can occurr:
>  - upon vcpu v wakeup, v itself is put in the runqueue,
>    and pcpu X is tickled;
>  - pcpu Y schedules (for whatever reason), sees v in
>    the runqueue and picks it up.
> 
> This may seem ok (or even a good thing), but it's not.
> In fact, if runq_tickle() decided X is where v should
> run, it did it for a reason (load distribution, SMT
> support, cache hotness, affinity, etc), and we really
> should try as hard as possible to stick to that.
> 
> Of course, we can't be too strict, or we risk leaving
> vcpus in the runqueue while there is available CPU
> capacity. So, we only leave v in runqueue --for X to
> pick it up-- if we see that X has been tickled and
> has not scheduled yet, i.e., it will have a real chance
> of actually select and schedule v.
> 
> If that is not the case, we schedule it on Y (or, at
> least, we consider that), as running somewhere non-ideal
> is better than not running at all.
> 
> The commit also adds performance counters for each of
> the possible situations.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/common/sched_credit2.c   |   65 +++++++++++++++++++++++++++++++++++++++---
>  xen/include/xen/perfc_defn.h |    3 ++
>  2 files changed, 64 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 12dfd20..a3d7beb 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -54,6 +54,7 @@
>  #define TRC_CSCHED2_LOAD_CHECK       TRC_SCHED_CLASS_EVT(CSCHED2, 16)
>  #define TRC_CSCHED2_LOAD_BALANCE     TRC_SCHED_CLASS_EVT(CSCHED2, 17)
>  #define TRC_CSCHED2_PICKED_CPU       TRC_SCHED_CLASS_EVT(CSCHED2, 19)
> +#define TRC_CSCHED2_RUNQ_CANDIDATE   TRC_SCHED_CLASS_EVT(CSCHED2, 20)
>  
>  /*
>   * WARNING: This is still in an experimental phase.  Status and work can be found at the
> @@ -398,6 +399,7 @@ struct csched2_vcpu {
>      int credit;
>      s_time_t start_time; /* When we were scheduled (used for credit) */
>      unsigned flags;      /* 16 bits doesn't seem to play well with clear_bit() */
> +    int tickled_cpu;     /* cpu tickled for picking us up (-1 if none) */
>  
>      /* Individual contribution to load */
>      s_time_t load_last_update;  /* Last time average was updated */
> @@ -1049,6 +1051,10 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>      __cpumask_set_cpu(ipid, &rqd->tickled);
>      smt_idle_mask_clear(ipid, &rqd->smt_idle);
>      cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
> +
> +    if ( unlikely(new->tickled_cpu != -1) )
> +        SCHED_STAT_CRANK(tickled_cpu_overwritten);
> +    new->tickled_cpu = ipid;
>  }
>  
>  /*
> @@ -1266,6 +1272,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
>          ASSERT(svc->sdom != NULL);
>          svc->credit = CSCHED2_CREDIT_INIT;
>          svc->weight = svc->sdom->weight;
> +        svc->tickled_cpu = -1;
>          /* Starting load of 50% */
>          svc->avgload = 1ULL << (CSCHED2_PRIV(ops)->load_precision_shift - 1);
>          svc->load_last_update = NOW() >> LOADAVG_GRANULARITY_SHIFT;
> @@ -1273,6 +1280,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
>      else
>      {
>          ASSERT(svc->sdom == NULL);
> +        svc->tickled_cpu = svc->vcpu->vcpu_id;
>          svc->credit = CSCHED2_IDLE_CREDIT;
>          svc->weight = 0;
>      }
> @@ -2233,7 +2241,8 @@ void __dump_execstate(void *unused);
>  static struct csched2_vcpu *
>  runq_candidate(struct csched2_runqueue_data *rqd,
>                 struct csched2_vcpu *scurr,
> -               int cpu, s_time_t now)
> +               int cpu, s_time_t now,
> +               unsigned int *pos)

I think I'd prefer if this were called "skipped" or something like that
-- to indicate how many vcpus in the runqueue had been skipped before
coming to this one.

>  {
>      struct list_head *iter;
>      struct csched2_vcpu *snext = NULL;
> @@ -2262,13 +2271,29 @@ runq_candidate(struct csched2_runqueue_data *rqd,
>  
>          /* Only consider vcpus that are allowed to run on this processor. */
>          if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
> +        {
> +            (*pos)++;
>              continue;
> +        }
> +
> +        /*
> +         * If a vcpu is meant to be picked up by another processor, and such
> +         * processor has not scheduled yet, leave it in the runqueue for him.
> +         */
> +        if ( svc->tickled_cpu != -1 && svc->tickled_cpu != cpu &&
> +             cpumask_test_cpu(svc->tickled_cpu, &rqd->tickled) )
> +        {
> +            (*pos)++;
> +            SCHED_STAT_CRANK(deferred_to_tickled_cpu);
> +            continue;
> +        }
>  
>          /* If this is on a different processor, don't pull it unless
>           * its credit is at least CSCHED2_MIGRATE_RESIST higher. */
>          if ( svc->vcpu->processor != cpu
>               && snext->credit + CSCHED2_MIGRATE_RESIST > svc->credit )
>          {
> +            (*pos)++;
>              SCHED_STAT_CRANK(migrate_resisted);
>              continue;
>          }
> @@ -2280,9 +2305,26 @@ runq_candidate(struct csched2_runqueue_data *rqd,
>  
>          /* In any case, if we got this far, break. */
>          break;
> +    }
>  
> +    if ( unlikely(tb_init_done) )
> +    {
> +        struct {
> +            unsigned vcpu:16, dom:16;
> +            unsigned tickled_cpu, position;
> +        } d;
> +        d.dom = snext->vcpu->domain->domain_id;
> +        d.vcpu = snext->vcpu->vcpu_id;
> +        d.tickled_cpu = snext->tickled_cpu;
> +        d.position = *pos;
> +        __trace_var(TRC_CSCHED2_RUNQ_CANDIDATE, 1,
> +                    sizeof(d),
> +                    (unsigned char *)&d);
>      }
>  
> +    if ( unlikely(snext->tickled_cpu != -1 && snext->tickled_cpu != cpu) )
> +        SCHED_STAT_CRANK(tickled_cpu_overridden);
> +
>      return snext;
>  }
>  
> @@ -2298,6 +2340,7 @@ csched2_schedule(
>      struct csched2_runqueue_data *rqd;
>      struct csched2_vcpu * const scurr = CSCHED2_VCPU(current);
>      struct csched2_vcpu *snext = NULL;
> +    unsigned int snext_pos = 0;
>      struct task_slice ret;
>  
>      SCHED_STAT_CRANK(schedule);
> @@ -2347,7 +2390,7 @@ csched2_schedule(
>          snext = CSCHED2_VCPU(idle_vcpu[cpu]);
>      }
>      else
> -        snext=runq_candidate(rqd, scurr, cpu, now);
> +        snext = runq_candidate(rqd, scurr, cpu, now, &snext_pos);
>  
>      /* If switching from a non-idle runnable vcpu, put it
>       * back on the runqueue. */
> @@ -2371,8 +2414,21 @@ csched2_schedule(
>              __set_bit(__CSFLAG_scheduled, &snext->flags);
>          }
>  
> -        /* Check for the reset condition */
> -        if ( snext->credit <= CSCHED2_CREDIT_RESET )
> +        /*
> +         * The reset condition is "has a scheduler epoch come to an end?".
> +         * The way this is enforced is checking whether the vcpu at the top
> +         * of the runqueue has negative credits. This means the epochs have
> +         * variable lenght, as in one epoch expores when:
> +         *  1) the vcpu at the top of the runqueue has executed for
> +         *     around 10 ms (with default parameters);
> +         *  2) no other vcpu with higher credits wants to run.
> +         *
> +         * Here, where we want to check for reset, we need to make sure the
> +         * proper vcpu is being used. In fact, runqueue_candidate() may have
> +         * not returned the first vcpu in the runqueue, for various reasons
> +         * (e.g., affinity). Only trigger a reset when it does.
> +         */
> +        if ( snext_pos == 0 && snext->credit <= CSCHED2_CREDIT_RESET )

This bit wasn't mentioned in the description. :-)

There's a certain amount of sense to the idea here, but it's the kind of
thing that may have strange side effects.  Did you look at traces before
and after this change?  And does the behavior seem more rational?

If so, I'm happy to trust your judgement -- just want to check to make
sure. :-)

Everything else looks good, thanks.

 -George

Dario Faggioli Sept. 29, 2016, 3:24 p.m. UTC | #6

On Tue, 2016-09-13 at 12:13 +0100, George Dunlap wrote:
> On 05/09/16 14:47, Dario Faggioli wrote:
> > On Wed, 2016-08-31 at 18:10 +0100, anshul makkar wrote:
> > > > @@ -1273,6 +1280,7 @@ csched2_alloc_vdata(const struct
> > > > scheduler
> > > > *ops, struct vcpu *vc, void *dd)
> > > >       else
> > > >       {
> > > >           ASSERT(svc->sdom == NULL);
> > > > +        svc->tickled_cpu = svc->vcpu->vcpu_id;
> > > If I understood correctly, tickled_cpu refers to pcpu and not a
> > > vcpu. 
> > > Saving vcpu_id in tickled_cpu looks wrong.
> > > 
> > Yes, and in fact, as you can see in the previous hunk, for pretty
> > much
> > all vcpus, tickled_cpu is initialized to -1.
> > 
> > Here, we are dealing with the vcpus of the idle domain. And for
> > vcpus
> > of the idle domain, their vcpu id is the same as the id of the pcpu
> > they're associated to.
> 
> But what I haven't sussed out yet is why we need to initialize this
> for
> the idle domain at all.  What benefit does it give you, and what
> effect
> does it have?
> 
It makes things more consistent and uniform, one effect being that it
is easier to manage a performance counter, for recording the number of
time this 'direct tickling' mechanism has been overridden.

In fact, I've tried it and, AFAICR, doing this was looking worse, when
I was not initializing the field for idle vcpus.

static struct csched2_vcpu *
runq_candidate(struct csched2_runqueue_data *rqd,
               struct csched2_vcpu *scurr,
               int cpu, s_time_t now,
               unsigned int *pos)
{
    ...
    [1]
    if ( unlikely(snext->tickled_cpu != -1 && snext->tickled_cpu != cpu) )
        SCHED_STAT_CRANK(tickled_cpu_overridden);

    return snext;
}

In fact, it is possible to reach [1] with snext being the idle vcpu, in
which case, if tickled_cpu is 0 for all of them (which would be the
case, I think, if I don't init it) the counter will be incremented in a
not really predictable way.

That being said, initializing to -1 should work, and it's easier to
read and understand (as I won't be special casing idle vcpus). So I'd
go for it (and test it, of course).. what do you think?

Thanks and Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Dario Faggioli Sept. 30, 2016, 2:22 a.m. UTC | #7

On Tue, 2016-09-13 at 12:28 +0100, George Dunlap wrote:
> On 17/08/16 18:18, Dario Faggioli wrote:
> > 
> diff --git a/xen/common/sched_credit2.c
> > @@ -2233,7 +2241,8 @@ void __dump_execstate(void *unused);
> >  static struct csched2_vcpu *
> >  runq_candidate(struct csched2_runqueue_data *rqd,
> >                 struct csched2_vcpu *scurr,
> > -               int cpu, s_time_t now)
> > +               int cpu, s_time_t now,
> > +               unsigned int *pos)
> 
> I think I'd prefer if this were called "skipped" or something like
> that
> -- to indicate how many vcpus in the runqueue had been skipped before
> coming to this one.
> 
Done.

> > @@ -2298,6 +2340,7 @@ csched2_schedule(
> >      struct csched2_runqueue_data *rqd;
> >      struct csched2_vcpu * const scurr = CSCHED2_VCPU(current);
> >      struct csched2_vcpu *snext = NULL;
> > +    unsigned int snext_pos = 0;
> >      struct task_slice ret;
> >  
> >      SCHED_STAT_CRANK(schedule);
> > @@ -2347,7 +2390,7 @@ csched2_schedule(
> >          snext = CSCHED2_VCPU(idle_vcpu[cpu]);
> >      }
> >      else
> > -        snext=runq_candidate(rqd, scurr, cpu, now);
> > +        snext = runq_candidate(rqd, scurr, cpu, now, &snext_pos);
> >  
> >      /* If switching from a non-idle runnable vcpu, put it
> >       * back on the runqueue. */
> > @@ -2371,8 +2414,21 @@ csched2_schedule(
> >              __set_bit(__CSFLAG_scheduled, &snext->flags);
> >          }
> >  
> > -        /* Check for the reset condition */
> > -        if ( snext->credit <= CSCHED2_CREDIT_RESET )
> > +        /*
> > +         * The reset condition is "has a scheduler epoch come to
> > an end?".
> > +         * The way this is enforced is checking whether the vcpu
> > at the top
> > +         * of the runqueue has negative credits. This means the
> > epochs have
> > +         * variable lenght, as in one epoch expores when:
> > +         *  1) the vcpu at the top of the runqueue has executed
> > for
> > +         *     around 10 ms (with default parameters);
> > +         *  2) no other vcpu with higher credits wants to run.
> > +         *
> > +         * Here, where we want to check for reset, we need to make
> > sure the
> > +         * proper vcpu is being used. In fact,
> > runqueue_candidate() may have
> > +         * not returned the first vcpu in the runqueue, for
> > various reasons
> > +         * (e.g., affinity). Only trigger a reset when it does.
> > +         */
> > +        if ( snext_pos == 0 && snext->credit <=
> > CSCHED2_CREDIT_RESET )
> 
> This bit wasn't mentioned in the description. :-)
> 
You're right. Actually, I think this change deserves to be in its own
patch, so in v2 I'm splitting this patch in two.

> There's a certain amount of sense to the idea here, but it's the kind
> of
> thing that may have strange side effects.  Did you look at traces
> before
> and after this change?  And does the behavior seem more rational?
> 
I have. It's not like it was happening a lot of times that we were
resetting upon the wrong vcpus, but I indeed have caught a couple of
examples.

And yes, the trace looked more 'regular' with this patch. Or, IOW,
without this patch, there were some of the reset events that were
suspiciously closer between each other.

TBH, in the vast majority of the cases, even when a "spurious reset"
was involved, the difference was rather hard to tell, but please,
consider that the combination of hard-affinity, this patch and soft-
affinity will potentially make things much worse (and in fact, I saw
the most severe occurrences when using hard-affinity).

It's also rather hard to measure the effect, but I think what is
implemented here is the right thing to do. And even if it may be hard
to measure the performance impact, I claim that this is a 'correctness'
issue, or at least a matter of adhering as much as possible to the
algorithm theory and idea.

> If so, I'm happy to trust your judgement -- just want to check to
> make
> sure. :-)
> 
Ah, thanks. :-)

Regards,
Dario

[05/24] xen: credit2: make tickling more deterministic

Commit Message

Comments

Patch