diff mbox

[v9] xen: sched: convert RTDS from time to event driven model

Message ID 1458000275-4237-1-git-send-email-tiche@seas.upenn.edu (mailing list archive)
State New, archived
Headers show

Commit Message

Tianyang Chen March 15, 2016, 12:04 a.m. UTC
Budget replenishment and enforcement are separated by adding
a replenishment timer, which fires at the next most imminent
release time of all runnable vcpus.

A replenishment queue has been added to keep track of all vcpus that
are runnable.

The following functions have major changes to manage the replenishment
events and timer.

repl_handler(): It is a timer handler which is re-programmed
to fire at the nearest vcpu deadline to replenish vcpus.

rt_schedule(): picks the highest runnable vcpu based on cpu
affinity and ret.time will be passed to schedule(). If an idle
vcpu is picked, -1 is returned to avoid busy-waiting. repl_update()
has been removed.

rt_vcpu_wake(): when a vcpu wakes up, it tickles instead of
picking one from the run queue.

rt_context_saved(): when context switching is finished, the
preempted vcpu will be put back into the runq.

Simplified funtional graph:

schedule.c SCHEDULE_SOFTIRQ:
    rt_schedule():
        [spin_lock]
        burn_budget(scurr)
        snext = runq_pick()
        [spin_unlock]

sched_rt.c TIMER_SOFTIRQ
    replenishment_timer_handler()
        [spin_lock]
        <for_each_vcpu_on_q(i)> {
            replenish(i)
            runq_tickle(i)
        }>
        program_timer()
        [spin_lock]

Signed-off-by: Tianyang Chen <tiche@seas.upenn.edu>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Dagaen Golomb <dgolomb@seas.upenn.edu>
---
changes since v8:
    Replaced spin_lock_irqsave with spin_lock_irq.
    Bug fix in burn_budget() where budget == 0.
    Removed unnecessary tickling in the timer handler when
    vcpus have the same deadline.

changes since v7:
    Coding sytle.
    Added a flag to indicate that a vcpu was depleted.
    Added a local list including updated vcpus in the
    timer handler. It is used to decide which vcpu should
    tickle.

changes since v6:
    Removed unnecessary vcpu_on_q() checks for runq_insert()
    Renamed replenishment queue related functions without
    underscores
    New replq_reinsert() function for rt_vcpu_wake()

changes since v5:
    Removed unnecessary vcpu_on_replq() checks
    deadline_queue_insert() returns a flag to indicate if it's
    needed to re-program the timer
    Removed unnecessary timer checks
    Added helper functions to remove vcpus from queues
    coding style

Changes since v4:
    removed unnecessary replenishment queue checks in vcpu_wake()
    extended replq_remove() to all cases in vcpu_sleep()
    used _deadline_queue_insert() helper function for both queues
    _replq_insert() and _replq_remove() program timer internally

Changes since v3:
    removed running queue.
    added repl queue to keep track of repl events.
    timer is now per scheduler.
    timer is init on a valid cpu in a cpupool.
---
 xen/common/sched_rt.c |  437 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 341 insertions(+), 96 deletions(-)

Comments

Meng Xu March 16, 2016, 3:11 a.m. UTC | #1
On Mon, Mar 14, 2016 at 8:04 PM, Tianyang Chen <tiche@seas.upenn.edu> wrote:
> Budget replenishment and enforcement are separated by adding
> a replenishment timer, which fires at the next most imminent
> release time of all runnable vcpus.
>
> A replenishment queue has been added to keep track of all vcpus that
> are runnable.
>
> The following functions have major changes to manage the replenishment
> events and timer.
>
> repl_handler(): It is a timer handler which is re-programmed
> to fire at the nearest vcpu deadline to replenish vcpus.
>
> rt_schedule(): picks the highest runnable vcpu based on cpu
> affinity and ret.time will be passed to schedule(). If an idle
> vcpu is picked, -1 is returned to avoid busy-waiting. repl_update()
> has been removed.
>
> rt_vcpu_wake(): when a vcpu wakes up, it tickles instead of
> picking one from the run queue.
>
> rt_context_saved(): when context switching is finished, the
> preempted vcpu will be put back into the runq.
>
> Simplified funtional graph:
>
> schedule.c SCHEDULE_SOFTIRQ:
>     rt_schedule():
>         [spin_lock]
>         burn_budget(scurr)
>         snext = runq_pick()
>         [spin_unlock]
>
> sched_rt.c TIMER_SOFTIRQ
>     replenishment_timer_handler()
>         [spin_lock]
>         <for_each_vcpu_on_q(i)> {
>             replenish(i)
>             runq_tickle(i)
>         }>
>         program_timer()
>         [spin_lock]
>
> Signed-off-by: Tianyang Chen <tiche@seas.upenn.edu>
> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
> Signed-off-by: Dagaen Golomb <dgolomb@seas.upenn.edu>
> ---
> changes since v8:
>     Replaced spin_lock_irqsave with spin_lock_irq.
>     Bug fix in burn_budget() where budget == 0.
>     Removed unnecessary tickling in the timer handler when
>     vcpus have the same deadline.
>
> changes since v7:
>     Coding sytle.
>     Added a flag to indicate that a vcpu was depleted.
>     Added a local list including updated vcpus in the
>     timer handler. It is used to decide which vcpu should
>     tickle.
>
> changes since v6:
>     Removed unnecessary vcpu_on_q() checks for runq_insert()
>     Renamed replenishment queue related functions without
>     underscores
>     New replq_reinsert() function for rt_vcpu_wake()
>
> changes since v5:
>     Removed unnecessary vcpu_on_replq() checks
>     deadline_queue_insert() returns a flag to indicate if it's
>     needed to re-program the timer
>     Removed unnecessary timer checks
>     Added helper functions to remove vcpus from queues
>     coding style
>
> Changes since v4:
>     removed unnecessary replenishment queue checks in vcpu_wake()
>     extended replq_remove() to all cases in vcpu_sleep()
>     used _deadline_queue_insert() helper function for both queues
>     _replq_insert() and _replq_remove() program timer internally
>
> Changes since v3:
>     removed running queue.
>     added repl queue to keep track of repl events.
>     timer is now per scheduler.
>     timer is init on a valid cpu in a cpupool.
> ---
>  xen/common/sched_rt.c |  437 ++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 341 insertions(+), 96 deletions(-)
>
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> index bfed2e2..b491915 100644
> --- a/xen/common/sched_rt.c
> +++ b/xen/common/sched_rt.c
> @@ -3,7 +3,9 @@
>   * EDF scheduling is a real-time scheduling algorithm used in embedded field.
>   *
>   * by Sisu Xi, 2013, Washington University in Saint Louis
> - * and Meng Xu, 2014, University of Pennsylvania
> + * Meng Xu, 2014-2016, University of Pennsylvania
> + * Tianyang Chen, 2016, University of Pennsylvania
> + * Dagaen Golomb, 2016, University of Pennsylvania
>   *
>   * based on the code of credit Scheduler
>   */
> @@ -16,6 +18,7 @@
>  #include <xen/delay.h>
>  #include <xen/event.h>
>  #include <xen/time.h>
> +#include <xen/timer.h>
>  #include <xen/perfc.h>
>  #include <xen/sched-if.h>
>  #include <xen/softirq.h>
> @@ -87,7 +90,7 @@
>  #define RTDS_DEFAULT_BUDGET     (MICROSECS(4000))
>
>  #define UPDATE_LIMIT_SHIFT      10
> -#define MAX_SCHEDULE            (MILLISECS(1))
> +
>  /*
>   * Flags
>   */
> @@ -115,6 +118,18 @@
>  #define RTDS_delayed_runq_add (1<<__RTDS_delayed_runq_add)
>
>  /*
> + * The replenishment timer handler needs to check this bit
> + * to know where a replenished vcpu was, when deciding which
> + * vcpu should tickle.
> + * A replenished vcpu should tickle if it was moved from the
> + * depleted queue to the run queue.
> + * + Set in burn_budget() if a vcpu has zero budget left.
> + * + Cleared and checked in the repenishment handler.

It seems you have an extra + here...
Need to be removed.

My bad, I didn't spot it out in last patch... :-(

> + */
> +#define __RTDS_was_depleted     3
> +#define RTDS_was_depleted (1<<__RTDS_was_depleted)
> +
> +/*
>   * rt tracing events ("only" 512 available!). Check
>   * include/public/trace.h for more details.
>   */
> @@ -142,6 +157,9 @@ static cpumask_var_t *_cpumask_scratch;
>   */
>  static unsigned int nr_rt_ops;
>
> +/* handler for the replenishment timer */
> +static void repl_handler(void *data);
> +
>  /*
>   * Systme-wide private data, include global RunQueue/DepletedQ
>   * Global lock is referenced by schedule_data.schedule_lock from all
> @@ -152,7 +170,9 @@ struct rt_private {
>      struct list_head sdom;      /* list of availalbe domains, used for dump */
>      struct list_head runq;      /* ordered list of runnable vcpus */
>      struct list_head depletedq; /* unordered list of depleted vcpus */
> +    struct list_head replq;     /* ordered list of vcpus that need replenishment */
>      cpumask_t tickled;          /* cpus been tickled */
> +    struct timer *repl_timer;   /* replenishment timer */
>  };
>
>  /*
> @@ -160,6 +180,7 @@ struct rt_private {
>   */
>  struct rt_vcpu {
>      struct list_head q_elem;    /* on the runq/depletedq list */
> +    struct list_head replq_elem; /* on the repl event list */
>
>      /* Up-pointers */
>      struct rt_dom *sdom;
> @@ -213,8 +234,14 @@ static inline struct list_head *rt_depletedq(const struct scheduler *ops)
>      return &rt_priv(ops)->depletedq;
>  }
>
> +static inline struct list_head *rt_replq(const struct scheduler *ops)
> +{
> +    return &rt_priv(ops)->replq;
> +}
> +
>  /*
> - * Queue helper functions for runq and depletedq
> + * Helper functions for manipulating the runqueue, the depleted queue,
> + * and the replenishment events queue.
>   */
>  static int
>  __vcpu_on_q(const struct rt_vcpu *svc)
> @@ -228,6 +255,18 @@ __q_elem(struct list_head *elem)
>      return list_entry(elem, struct rt_vcpu, q_elem);
>  }
>
> +static struct rt_vcpu *
> +replq_elem(struct list_head *elem)
> +{
> +    return list_entry(elem, struct rt_vcpu, replq_elem);
> +}
> +
> +static int
> +vcpu_on_replq(const struct rt_vcpu *svc)
> +{
> +    return !list_empty(&svc->replq_elem);
> +}
> +
>  /*
>   * Debug related code, dump vcpu/cpu information
>   */
> @@ -288,7 +327,7 @@ rt_dump_pcpu(const struct scheduler *ops, int cpu)
>  static void
>  rt_dump(const struct scheduler *ops)
>  {
> -    struct list_head *runq, *depletedq, *iter;
> +    struct list_head *runq, *depletedq, *replq, *iter;
>      struct rt_private *prv = rt_priv(ops);
>      struct rt_vcpu *svc;
>      struct rt_dom *sdom;
> @@ -301,6 +340,7 @@ rt_dump(const struct scheduler *ops)
>
>      runq = rt_runq(ops);
>      depletedq = rt_depletedq(ops);
> +    replq = rt_replq(ops);
>
>      printk("Global RunQueue info:\n");
>      list_for_each( iter, runq )
> @@ -316,6 +356,13 @@ rt_dump(const struct scheduler *ops)
>          rt_dump_vcpu(ops, svc);
>      }
>
> +    printk("Global Replenishment Event info:\n");
> +    list_for_each ( iter, replq )
> +    {
> +        svc = replq_elem(iter);
> +        rt_dump_vcpu(ops, svc);
> +    }
> +
>      printk("Domain info:\n");
>      list_for_each( iter, &prv->sdom )
>      {
> @@ -380,11 +427,78 @@ rt_update_deadline(s_time_t now, struct rt_vcpu *svc)
>      return;
>  }
>
> +/*
> + * A helper function that only removes a vcpu from a queue
> + * and it returns 1 if the vcpu was in front of the list.
> + */
> +static inline int
> +deadline_queue_remove(struct list_head *queue, struct list_head *elem)
> +{
> +    int pos = 0;
> +
> +    if ( queue->next != elem )
> +        pos = 1;
> +
> +    list_del_init(elem);
> +    return !pos;
> +}
> +
>  static inline void
>  __q_remove(struct rt_vcpu *svc)
>  {
> -    if ( __vcpu_on_q(svc) )
> -        list_del_init(&svc->q_elem);
> +    ASSERT( __vcpu_on_q(svc) );
> +    list_del_init(&svc->q_elem);
> +}
> +
> +/*
> + * Removing a vcpu from the replenishment queue could
> + * re-program the timer for the next replenishment event
> + * if it was at the front of the list.
> + */
> +static inline void
> +__replq_remove(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct rt_private *prv = rt_priv(ops);
> +    struct list_head *replq = rt_replq(ops);
> +    struct timer* repl_timer = prv->repl_timer;
> +
> +    ASSERT( vcpu_on_replq(svc) );
> +
> +    if ( deadline_queue_remove(replq, &svc->replq_elem) )
> +    {
> +        /* re-arm the timer for the next replenishment event */
> +        if ( !list_empty(replq) )
> +        {
> +            struct rt_vcpu *svc_next = replq_elem(replq->next);
> +            set_timer(repl_timer, svc_next->cur_deadline);
> +        }
> +        else
> +            stop_timer(repl_timer);
> +    }
> +}
> +
> +/*
> + * An utility function that inserts a vcpu to a
> + * queue based on certain order (EDF). The returned
> + * value is 1 if a vcpu has been inserted to the
> + * front of a list.
> + */
> +static int
> +deadline_queue_insert(struct rt_vcpu * (*_get_q_elem)(struct list_head *elem),
> +    struct rt_vcpu *svc, struct list_head *elem, struct list_head *queue)
> +{
> +    struct list_head *iter;
> +    int pos = 0;
> +
> +    list_for_each ( iter, queue )
> +    {
> +        struct rt_vcpu * iter_svc = (*_get_q_elem)(iter);
> +        if ( svc->cur_deadline <= iter_svc->cur_deadline )
> +            break;
> +        pos++;
> +    }
> +    list_add_tail(elem, iter);
> +    return !pos;
>  }
>
>  /*
> @@ -397,27 +511,62 @@ __runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
>  {
>      struct rt_private *prv = rt_priv(ops);
>      struct list_head *runq = rt_runq(ops);
> -    struct list_head *iter;
>
>      ASSERT( spin_is_locked(&prv->lock) );
> -
>      ASSERT( !__vcpu_on_q(svc) );
> +    ASSERT( vcpu_on_replq(svc) );
>
>      /* add svc to runq if svc still has budget */
>      if ( svc->cur_budget > 0 )
> -    {
> -        list_for_each(iter, runq)
> -        {
> -            struct rt_vcpu * iter_svc = __q_elem(iter);
> -            if ( svc->cur_deadline <= iter_svc->cur_deadline )
> -                    break;
> -         }
> -        list_add_tail(&svc->q_elem, iter);
> -    }
> +        deadline_queue_insert(&__q_elem, svc, &svc->q_elem, runq);
>      else
> -    {
>          list_add(&svc->q_elem, &prv->depletedq);
> +}
> +
> +/*
> + * Insert svc into the replenishment event list
> + * in replenishment time order.
> + * vcpus that need to be replished earlier go first.
> + * The timer may be re-programmed if svc is inserted
> + * at the front of the event list.
> + */
> +static void
> +__replq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct list_head *replq = rt_replq(ops);
> +    struct rt_private *prv = rt_priv(ops);
> +    struct timer *repl_timer = prv->repl_timer;
> +
> +    ASSERT( !vcpu_on_replq(svc) );
> +
> +    if ( deadline_queue_insert(&replq_elem, svc, &svc->replq_elem, replq) )
> +        set_timer(repl_timer, svc->cur_deadline);
> +}
> +
> +/*
> + * Removes and re-inserts an event to the replenishment queue.
> + */
> +static void
> +replq_reinsert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct list_head *replq = rt_replq(ops);
> +    struct rt_private *prv = rt_priv(ops);
> +    struct timer *repl_timer = prv->repl_timer;
> +
> +    ASSERT( vcpu_on_replq(svc) );
> +
> +    if ( deadline_queue_remove(replq, &svc->replq_elem) )
> +    {
> +        if ( deadline_queue_insert(&replq_elem, svc, &svc->replq_elem, replq) )
> +            set_timer(repl_timer, svc->cur_deadline);
> +        else
> +        {
> +            struct rt_vcpu *svc_next = replq_elem(replq->next);
> +            set_timer(repl_timer, svc_next->cur_deadline);
> +        }
>      }
> +    else if ( deadline_queue_insert(&replq_elem, svc, &svc->replq_elem, replq) )
> +        set_timer(repl_timer, svc->cur_deadline);
>  }
>
>  /*
> @@ -449,11 +598,18 @@ rt_init(struct scheduler *ops)
>      INIT_LIST_HEAD(&prv->sdom);
>      INIT_LIST_HEAD(&prv->runq);
>      INIT_LIST_HEAD(&prv->depletedq);
> +    INIT_LIST_HEAD(&prv->replq);
>
>      cpumask_clear(&prv->tickled);
>
>      ops->sched_data = prv;
>
> +    /*
> +     * The timer initialization will happen later when
> +     * the first pcpu is added to this pool in alloc_pdata.
> +     */
> +    prv->repl_timer = NULL;
> +
>      return 0;
>
>   no_mem:
> @@ -473,6 +629,10 @@ rt_deinit(struct scheduler *ops)
>          xfree(_cpumask_scratch);
>          _cpumask_scratch = NULL;
>      }
> +
> +    kill_timer(prv->repl_timer);
> +    xfree(prv->repl_timer);
> +
>      ops->sched_data = NULL;
>      xfree(prv);
>  }
> @@ -494,6 +654,17 @@ rt_alloc_pdata(const struct scheduler *ops, int cpu)
>      if ( !alloc_cpumask_var(&_cpumask_scratch[cpu]) )
>          return NULL;
>
> +    if ( prv->repl_timer == NULL )
> +    {
> +        /* Allocate the timer on the first cpu of this pool. */
> +        prv->repl_timer = xzalloc(struct timer);
> +
> +        if ( prv->repl_timer == NULL )
> +            return NULL;
> +
> +        init_timer(prv->repl_timer, repl_handler, (void *)ops, cpu);
> +    }
> +
>      /* 1 indicates alloc. succeed in schedule.c */
>      return (void *)1;
>  }
> @@ -587,6 +758,7 @@ rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
>          return NULL;
>
>      INIT_LIST_HEAD(&svc->q_elem);
> +    INIT_LIST_HEAD(&svc->replq_elem);
>      svc->flags = 0U;
>      svc->sdom = dd;
>      svc->vcpu = vc;
> @@ -610,7 +782,8 @@ rt_free_vdata(const struct scheduler *ops, void *priv)
>  }
>
>  /*
> - * This function is called in sched_move_domain() in schedule.c
> + * It is called in sched_move_domain() and sched_init_vcpu
> + * in schedule.c.
>   * When move a domain to a new cpupool.
>   * It inserts vcpus of moving domain to the scheduler's RunQ in
>   * dest. cpupool.
> @@ -628,8 +801,13 @@ rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
>      if ( now >= svc->cur_deadline )
>          rt_update_deadline(now, svc);
>
> -    if ( !__vcpu_on_q(svc) && vcpu_runnable(vc) && !vc->is_running )
> -        __runq_insert(ops, svc);
> +    if ( !__vcpu_on_q(svc) && vcpu_runnable(vc) )
> +    {
> +        __replq_insert(ops, svc);
> +
> +        if ( !vc->is_running )
> +            __runq_insert(ops, svc);
> +    }
>      vcpu_schedule_unlock_irq(lock, vc);
>
>      SCHED_STAT_CRANK(vcpu_insert);
> @@ -652,6 +830,10 @@ rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
>      lock = vcpu_schedule_lock_irq(vc);
>      if ( __vcpu_on_q(svc) )
>          __q_remove(svc);
> +
> +    if ( vcpu_on_replq(svc) )
> +        __replq_remove(ops,svc);
> +
>      vcpu_schedule_unlock_irq(lock, vc);
>  }
>
> @@ -706,8 +888,15 @@ burn_budget(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now)
>
>      svc->cur_budget -= delta;
>
> -    if ( svc->cur_budget < 0 )
> +    if ( svc->cur_budget <= 0 )
> +    {
>          svc->cur_budget = 0;
> +        /*
> +         * Set __RTDS_was_depleted so the replenishment
> +         * handler can let this vcpu tickle if it was refilled.
> +         */
> +        set_bit(__RTDS_was_depleted, &svc->flags);
> +    }
>
>      /* TRACE */
>      {
> @@ -784,44 +973,6 @@ __runq_pick(const struct scheduler *ops, const cpumask_t *mask)
>  }
>
>  /*
> - * Update vcpu's budget and
> - * sort runq by insert the modifed vcpu back to runq
> - * lock is grabbed before calling this function
> - */
> -static void
> -__repl_update(const struct scheduler *ops, s_time_t now)
> -{
> -    struct list_head *runq = rt_runq(ops);
> -    struct list_head *depletedq = rt_depletedq(ops);
> -    struct list_head *iter;
> -    struct list_head *tmp;
> -    struct rt_vcpu *svc = NULL;
> -
> -    list_for_each_safe(iter, tmp, runq)
> -    {
> -        svc = __q_elem(iter);
> -        if ( now < svc->cur_deadline )
> -            break;
> -
> -        rt_update_deadline(now, svc);
> -        /* reinsert the vcpu if its deadline is updated */
> -        __q_remove(svc);
> -        __runq_insert(ops, svc);
> -    }
> -
> -    list_for_each_safe(iter, tmp, depletedq)
> -    {
> -        svc = __q_elem(iter);
> -        if ( now >= svc->cur_deadline )
> -        {
> -            rt_update_deadline(now, svc);
> -            __q_remove(svc); /* remove from depleted queue */
> -            __runq_insert(ops, svc); /* add to runq */
> -        }
> -    }
> -}
> -
> -/*
>   * schedule function for rt scheduler.
>   * The lock is already grabbed in schedule.c, no need to lock here
>   */
> @@ -840,8 +991,6 @@ rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
>      /* burn_budget would return for IDLE VCPU */
>      burn_budget(ops, scurr, now);
>
> -    __repl_update(ops, now);
> -
>      if ( tasklet_work_scheduled )
>      {
>          trace_var(TRC_RTDS_SCHED_TASKLET, 1, 0,  NULL);
> @@ -868,6 +1017,7 @@ rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
>          set_bit(__RTDS_delayed_runq_add, &scurr->flags);
>
>      snext->last_start = now;
> +    ret.time =  -1; /* if an idle vcpu is picked */
>      if ( !is_idle_vcpu(snext->vcpu) )
>      {
>          if ( snext != scurr )
> @@ -880,9 +1030,8 @@ rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
>              snext->vcpu->processor = cpu;
>              ret.migrated = 1;
>          }
> +        ret.time = snext->budget; /* invoke the scheduler next time */

Ah, this is incorrect, although this is easy to fix.

The ret.time is the relative time when the *budget enforcement* timer
should be invoked.
Since snext is not always full budget, it may have already consumed some budget.

It should be
ret.time = snext->cur_budget

Isn't it? :-)

The other parts of the scheduler looks fine to me, as long as the
above comments are solved.

Hi Tianyang,
Great and expedite work! :-D

Hi Dario,
Could you have another look at this patch and see if it's ok after the
above comments are solved?

Thank you very much!

Meng
Tianyang Chen March 16, 2016, 3:32 a.m. UTC | #2
On 03/15/2016 11:11 PM, Meng Xu wrote:
>> +
>>   /*
>>    * Flags
>>    */
>> @@ -115,6 +118,18 @@
>>   #define RTDS_delayed_runq_add (1<<__RTDS_delayed_runq_add)
>>
>>   /*
>> + * The replenishment timer handler needs to check this bit
>> + * to know where a replenished vcpu was, when deciding which
>> + * vcpu should tickle.
>> + * A replenished vcpu should tickle if it was moved from the
>> + * depleted queue to the run queue.
>> + * + Set in burn_budget() if a vcpu has zero budget left.
>> + * + Cleared and checked in the repenishment handler.
>
> It seems you have an extra + here...
> Need to be removed.
>
> My bad, I didn't spot it out in last patch... :-(
>

You mean before "Cleared"? For __RTDS_scheduled there are '+' before 
'Cleared', 'Checked', 'Set'.

>> @@ -840,8 +991,6 @@ rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
>>       /* burn_budget would return for IDLE VCPU */
>>       burn_budget(ops, scurr, now);
>>
>> -    __repl_update(ops, now);
>> -
>>       if ( tasklet_work_scheduled )
>>       {
>>           trace_var(TRC_RTDS_SCHED_TASKLET, 1, 0,  NULL);
>> @@ -868,6 +1017,7 @@ rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
>>           set_bit(__RTDS_delayed_runq_add, &scurr->flags);
>>
>>       snext->last_start = now;
>> +    ret.time =  -1; /* if an idle vcpu is picked */
>>       if ( !is_idle_vcpu(snext->vcpu) )
>>       {
>>           if ( snext != scurr )
>> @@ -880,9 +1030,8 @@ rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
>>               snext->vcpu->processor = cpu;
>>               ret.migrated = 1;
>>           }
>> +        ret.time = snext->budget; /* invoke the scheduler next time */
>
> Ah, this is incorrect, although this is easy to fix.
>
> The ret.time is the relative time when the *budget enforcement* timer
> should be invoked.
> Since snext is not always full budget, it may have already consumed some budget.
>
> It should be
> ret.time = snext->cur_budget
>
> Isn't it? :-)
Good catch. This bug kinda ruins the fix for busy waiting.

Thanks,
Tianyang
Meng Xu March 16, 2016, 3:40 a.m. UTC | #3
>>> @@ -115,6 +118,18 @@
>>>   #define RTDS_delayed_runq_add (1<<__RTDS_delayed_runq_add)
>>>
>>>   /*
>>> + * The replenishment timer handler needs to check this bit
>>> + * to know where a replenished vcpu was, when deciding which
>>> + * vcpu should tickle.
>>> + * A replenished vcpu should tickle if it was moved from the
>>> + * depleted queue to the run queue.
>>> + * + Set in burn_budget() if a vcpu has zero budget left.
>>> + * + Cleared and checked in the repenishment handler.
>>
>>
>> It seems you have an extra + here...
>> Need to be removed.
>>
>> My bad, I didn't spot it out in last patch... :-(
>>
>
> You mean before "Cleared"? For __RTDS_scheduled there are '+' before
> 'Cleared', 'Checked', 'Set'.

Yes, those two +, are unnecessary. Isn't it?

>
>>> @@ -840,8 +991,6 @@ rt_schedule(const struct scheduler *ops, s_time_t
>>> now, bool_t tasklet_work_sched
>>>       /* burn_budget would return for IDLE VCPU */
>>>       burn_budget(ops, scurr, now);
>>>
>>> -    __repl_update(ops, now);
>>> -
>>>       if ( tasklet_work_scheduled )
>>>       {
>>>           trace_var(TRC_RTDS_SCHED_TASKLET, 1, 0,  NULL);
>>> @@ -868,6 +1017,7 @@ rt_schedule(const struct scheduler *ops, s_time_t
>>> now, bool_t tasklet_work_sched
>>>           set_bit(__RTDS_delayed_runq_add, &scurr->flags);
>>>
>>>       snext->last_start = now;
>>> +    ret.time =  -1; /* if an idle vcpu is picked */
>>>       if ( !is_idle_vcpu(snext->vcpu) )
>>>       {
>>>           if ( snext != scurr )
>>> @@ -880,9 +1030,8 @@ rt_schedule(const struct scheduler *ops, s_time_t
>>> now, bool_t tasklet_work_sched
>>>               snext->vcpu->processor = cpu;
>>>               ret.migrated = 1;
>>>           }
>>> +        ret.time = snext->budget; /* invoke the scheduler next time */
>>
>>
>> Ah, this is incorrect, although this is easy to fix.
>>
>> The ret.time is the relative time when the *budget enforcement* timer
>> should be invoked.
>> Since snext is not always full budget, it may have already consumed some
>> budget.
>>
>> It should be
>> ret.time = snext->cur_budget
>>
>> Isn't it? :-)
>
> Good catch. This bug kinda ruins the fix for busy waiting.

Right! This just shows why we may need some semi-automated testing
tools to test the logic correctness. ;-)

Meng
Dario Faggioli March 16, 2016, 10:23 a.m. UTC | #4
On Tue, 2016-03-15 at 23:40 -0400, Meng Xu wrote:
> > > > @@ -115,6 +118,18 @@
> > > >   #define RTDS_delayed_runq_add (1<<__RTDS_delayed_runq_add)
> > > > 
> > > >   /*
> > > > + * The replenishment timer handler needs to check this bit
> > > > + * to know where a replenished vcpu was, when deciding which
> > > > + * vcpu should tickle.
> > > > + * A replenished vcpu should tickle if it was moved from the
> > > > + * depleted queue to the run queue.
> > > > + * + Set in burn_budget() if a vcpu has zero budget left.
> > > > + * + Cleared and checked in the repenishment handler.
> > > 
> > > It seems you have an extra + here...
> > > Need to be removed.
> > > 
> > > My bad, I didn't spot it out in last patch... :-(
> > > 
> > You mean before "Cleared"? For __RTDS_scheduled there are '+'
> > before
> > 'Cleared', 'Checked', 'Set'.
> Yes, those two +, are unnecessary. Isn't it?
> 
I *think* the idea here was to sort of put down a bullet-ed list, but
maybe we should ask the author. According to `git blame', is a certain
Meng Xu, guy (commit 8726c055), anyone has his email address? :-D :-D

However, I don't particularly like either the style or the final result
(in terms of wording), so, let's avoid doing more of that in new code
(see my other email).

Regards,
Dario
Meng Xu March 16, 2016, 2:20 p.m. UTC | #5
On Wed, Mar 16, 2016 at 6:23 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Tue, 2016-03-15 at 23:40 -0400, Meng Xu wrote:
>> > > > @@ -115,6 +118,18 @@
>> > > >   #define RTDS_delayed_runq_add (1<<__RTDS_delayed_runq_add)
>> > > >
>> > > >   /*
>> > > > + * The replenishment timer handler needs to check this bit
>> > > > + * to know where a replenished vcpu was, when deciding which
>> > > > + * vcpu should tickle.
>> > > > + * A replenished vcpu should tickle if it was moved from the
>> > > > + * depleted queue to the run queue.
>> > > > + * + Set in burn_budget() if a vcpu has zero budget left.
>> > > > + * + Cleared and checked in the repenishment handler.
>> > >
>> > > It seems you have an extra + here...
>> > > Need to be removed.
>> > >
>> > > My bad, I didn't spot it out in last patch... :-(
>> > >
>> > You mean before "Cleared"? For __RTDS_scheduled there are '+'
>> > before
>> > 'Cleared', 'Checked', 'Set'.
>> Yes, those two +, are unnecessary. Isn't it?
>>
> I *think* the idea here was to sort of put down a bullet-ed list, but
> maybe we should ask the author. According to `git blame', is a certain
> Meng Xu, guy (commit 8726c055), anyone has his email address? :-D :-D

Ah-ha, let me try to find the Meng Xu. ;-)
Here we go. I cc.ed him... ;-)

>
> However, I don't particularly like either the style or the final result
> (in terms of wording), so, let's avoid doing more of that in new code
> (see my other email).

I checked the sched_credit.c and sched_credit2.c, it seems that either
+ or - is used as the list. When there are two levels of lists, both
are used.
However, it's inconsistent which one should be used at the top level.
Probably to keep the consistence in this file, we keep using + and
later when we want to clean up this style issue, if we will ,we can
replace them.

As to the comment, I will suggest:

/*
 * RTDS_was_depleted: Is a vcpus budget depleted?

 * + Set in burn_budget() when a vcpus budget turns to zero

 * + Checked and cleared in repl_handler() to replenish the budget

 */

What do you think?

BTW, how about other parts of the patch? Is there something that you don't like?
I think the invalid budget returned in rt_schedule() in this patch is
a serious logical bug.
If all of my comments are solved, I think it is in a good state and
I'm considering to send the reviewed-by tag in the near future.
However, I won't send the reviewed-by until I get your confirmation,
since this will be the first reviewed-by I will be giving as
maintainer. I'd like to take a safe step. :-D

Thanks and Best Regards,

Meng
Dario Faggioli March 16, 2016, 2:25 p.m. UTC | #6
Ok, it's about time that we deal with this changelog!

On Mon, 2016-03-14 at 20:04 -0400, Tianyang Chen wrote:
> Budget replenishment and enforcement are separated by adding
> a replenishment timer, which fires at the next most imminent
> release time of all runnable vcpus.
> 
First of all, state (quickly) the "problems". So, right now:
 - the scheduler, although the algorithm is event driven by nature, 
   follow a time driven model (is invoked periodically!), making the 
   code looks unnatural;
 - budget replenishment logic, budget enforcement logic and scheduling
   decisions are mixed and entangled, making the code hard to 
   understand;
 - the various queues of vcpus are scanned various times, making the 
   code inefficient;

Then describe your solution. The first sentence you've got up above is
ok...

> A replenishment queue has been added to keep track of all vcpus that
> are runnable.
> 
...and this one too.

> The following functions have major changes to manage the
> replenishment
> events and timer.
> 
> repl_handler(): It is a timer handler which is re-programmed
> to fire at the nearest vcpu deadline to replenish vcpus.
> 
> rt_schedule(): picks the highest runnable vcpu based on cpu
> affinity and ret.time will be passed to schedule(). If an idle
> vcpu is picked, -1 is returned to avoid busy-waiting. repl_update()
> has been removed.
> 
> rt_vcpu_wake(): when a vcpu wakes up, it tickles instead of
> picking one from the run queue.
> 
> rt_context_saved(): when context switching is finished, the
> preempted vcpu will be put back into the runq.
> 
This is too detailed for a changelog. If you want this information to
live somewhere (it already lives in the list archives, actually), make
a cover letter (this is just one patch, so it's not required, but
nothing forbids that). Or put it in a wiki page. Or write a blog post.
Or (which would be kind of nice) all of them! :-)

Just a few more words, in addition of the above two sentences, covering
the other changes done in the patch are more than enough. Such as:

"In the main scheduling function, we now return the next time when a
making a scheduling decision is going to be necessary, such as when the
budget of the currently running vcpu will run over.

Finally, when waking up a vcpu, it is now enough to tickle the various
CPUs appropriately, like all other schedulers also do."

> Simplified funtional graph:
> 
> schedule.c SCHEDULE_SOFTIRQ:
>     rt_schedule():
>         [spin_lock]
>         burn_budget(scurr)
>         snext = runq_pick()
>         [spin_unlock]
> 
> sched_rt.c TIMER_SOFTIRQ
>     replenishment_timer_handler()
>         [spin_lock]
>         <for_each_vcpu_on_q(i)> {
>             replenish(i)
>             runq_tickle(i)
>         }>
>         program_timer()
>         [spin_lock]
> 
And kill this (or, again, move to cover, wiki, blog, etc.) as well.

Also, trying to apply this gave:

/* 
<stdin>:194: trailing whitespace.
 * A helper function that only removes a vcpu from a queue 
<stdin>:355: trailing whitespace.
    /* 
<stdin>:356: trailing whitespace.
     * The timer initialization will happen later when 
<stdin>:380: trailing whitespace.
    {   
warning: squelched 10 whitespace errors
warning: 15 lines add whitespace errors.

And I think there are a couple of long lines as well.

> --- a/xen/common/sched_rt.c
> +++ b/xen/common/sched_rt.c
> @@ -3,7 +3,9 @@
>   * EDF scheduling is a real-time scheduling algorithm used in
> embedded field.
>   *
>   * by Sisu Xi, 2013, Washington University in Saint Louis
> - * and Meng Xu, 2014, University of Pennsylvania
> + * Meng Xu, 2014-2016, University of Pennsylvania
> + * Tianyang Chen, 2016, University of Pennsylvania
> + * Dagaen Golomb, 2016, University of Pennsylvania
>   *
 * Meng Xu, 2014-2016, University of Pennsylvania.
 *
 * Conversion toward event drive model by Tianyang Chen
 * and Dagaen Golomb, 2016, University of Pennsylvania.

> @@ -115,6 +118,18 @@
>  #define RTDS_delayed_runq_add (1<<__RTDS_delayed_runq_add)
>  
>  /*
> + * The replenishment timer handler needs to check this bit
> + * to know where a replenished vcpu was, when deciding which
> + * vcpu should tickle.
> + * A replenished vcpu should tickle if it was moved from the
> + * depleted queue to the run queue.
> + * + Set in burn_budget() if a vcpu has zero budget left.
> + * + Cleared and checked in the repenishment handler.
> + */
>
Stuff about how the replenishment timer handler uses it, with this
level of details, is better said in the replenishment timer handler
itself. Such that, if at some point we'll use this flag for other
things, we will not have to change this comment!

Just limit to what the flag is meant at representing. So, to sort of
follow the style of other similar comments (yes, perhaps it's not too
bad :-D).

And I'd lose the _was in the name.

/*
 * RTDS_depleted: does this vcpu run out of budget?
 * This flag is:
 * + set in burn_budget(), if a vcpu has zero budget left;
 * + checked and cleared in the replenishment timer handler,
 *   for the vcpus that are being replenished.
 */


> +#define __RTDS_was_depleted     3
> +#define RTDS_was_depleted (1<<__RTDS_was_depleted)
>
__RTDS_depleted and RTDS_depleted.

It's equally expressive and more general (i.e., suitable for more broad
usage in future, without needing to change the name at that time).

> @@ -142,6 +157,9 @@ static cpumask_var_t *_cpumask_scratch;
>   */
>  static unsigned int nr_rt_ops;
>  
> +/* handler for the replenishment timer */
> +static void repl_handler(void *data);
> +
Can we call the function repl_timer_handler() and lose the comment?

> @@ -160,6 +180,7 @@ struct rt_private {
>   */
>  struct rt_vcpu {
>      struct list_head q_elem;    /* on the runq/depletedq list */
> +    struct list_head replq_elem; /* on the repl event list */
>  
"on the replenishment events list"

It still fits, and it's much better.

> @@ -316,6 +356,13 @@ rt_dump(const struct scheduler *ops)
>          rt_dump_vcpu(ops, svc);
>      }
>  
> +    printk("Global Replenishment Event info:\n");
>
"Events"

> @@ -380,11 +427,78 @@ rt_update_deadline(s_time_t now, struct rt_vcpu
> *svc)
>      return;
>  }
>  
> +/* 
> + * A helper function that only removes a vcpu from a queue 
> + * and it returns 1 if the vcpu was in front of the list.
> + */
> +static inline int
> +deadline_queue_remove(struct list_head *queue, struct list_head
> *elem)
> +{
> +    int pos = 0;
> +
> +    if ( queue->next != elem )
> +        pos = 1;
> +
> +    list_del_init(elem);
> +    return !pos;
> +}
> +
Can we move deadline_queue_insert() here, so that these two similar
helpers are close to each other, and harmonize, or even unify the
comments.

Something like (put above deadline_queue_remove()):
/*
 * Helpers for removing and inserting a vcpu in a queue
 * that is being kept ordered by the vcpus' deadlines (as EDF
 * mandates).
 *
 * For callers' convenience, the vcpu removing helper returns
 * true if the vcpu removed was the one at the front of the
 * queue; similarly, the inserting helper returns true if the
 * inserted ended at the front of the queue (i.e., in both
 * cases, if the vcpu with the earliest deadline is what we
 * are dealing with).
 */

Also, it looks like the both can be of boolean return type.

> +/*
> + * Removing a vcpu from the replenishment queue could
> + * re-program the timer for the next replenishment event
> + * if it was at the front of the list.
> + */
>
This makes me think that it is the removal itself that,
automati_g_ically, would reprogram the timer.

Kill this comment and expand the one inside the function.

> +static inline void
> +__replq_remove(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
>
Can we call it replq_remove()?

> +    struct rt_private *prv = rt_priv(ops);
> +    struct list_head *replq = rt_replq(ops);
> +    struct timer* repl_timer = prv->repl_timer;
>
There's no need for this local variable (repl_timer).

> +
> +    ASSERT( vcpu_on_replq(svc) );
> +
> +    if ( deadline_queue_remove(replq, &svc->replq_elem) )
> +    {
> +        /* re-arm the timer for the next replenishment event */
/*
 * The replenishment timer needs to be set to fire when a
 * replenishment for the vcpu at the front of the replenishment
 * queue is due. If it is such vcpu that we just removed, we may
 * need to reprogram the timer.
 */

> +        if ( !list_empty(replq) )
> +        {
> +            struct rt_vcpu *svc_next = replq_elem(replq->next);
> +            set_timer(repl_timer, svc_next->cur_deadline);
> +        }
> +        else
> +            stop_timer(repl_timer);
> +    }
> +}
> +
> +/*
> + * An utility function that inserts a vcpu to a
> + * queue based on certain order (EDF). The returned
> + * value is 1 if a vcpu has been inserted to the
> + * front of a list.
> + */
> +static int
> +deadline_queue_insert(struct rt_vcpu * (*_get_q_elem)(struct
> list_head *elem),
> +    struct rt_vcpu *svc, struct list_head *elem, struct list_head
> *queue)
> +{
>
Indentation.

Also I'd use some #defines to make things look a bit better.

So:

static bool_t
deadline_queue_insert(struct rt_vcpu * (*qelem)(struct list_head *),
                      struct rt_vcpu *svc, struct list_head *elem,
                      struct list_head *queue)
{
 ...
}
#define deadline_runq_insert(...) \
  deadline_queue_insert(&__q_elem, ##__VA_ARGS__)
#define deadline_replq_insert(...) \
  deadline_queue_insert(&replq_elem, ##__VA_ARGS__)

And then, at the various call sites:

 deadline_runq_insert(svc, &svc->q_elem, runq);
 ...
 if ( deadline_replq_insert(svc, &svc->replq_elem, replq) )
 ...
 if ( deadline_replq_insert(svc, &svc->replq_elem, replq) )
 ...
 etc.

Which would also make a few long lines go away.

> @@ -397,27 +511,62 @@ __runq_insert(const struct scheduler *ops,
> struct rt_vcpu *svc)

> +/*
> + * Insert svc into the replenishment event list
> + * in replenishment time order.
> + * vcpus that need to be replished earlier go first.
> + * The timer may be re-programmed if svc is inserted
> + * at the front of the event list.
> + */
>
You can keep the lines a bit longer than this, the comment will look
better. No need to necessarily reach 79 characters, of course, but
reaching ~70 would be ok.

And, actually, even if I feel like I remember I suggested this comment
myself, I think it's best to do what I said already for replq_remove():
keep only the last sentence, and move it inside the fucntion (just
above the if is fine).

> +static void
> +__replq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
>
Call this replq_insert().

> +    struct list_head *replq = rt_replq(ops);
> +    struct rt_private *prv = rt_priv(ops);
> +    struct timer *repl_timer = prv->repl_timer;
> +
Again, no need for this.

> +/*
> + * Removes and re-inserts an event to the replenishment queue.
>
    * The aim is to update its position inside the queue, as its 
    * deadline (and hence its replenishment time) could have
    * changed.

> + */
> +static void
> +replq_reinsert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct list_head *replq = rt_replq(ops);
> +    struct rt_private *prv = rt_priv(ops);
> +    struct timer *repl_timer = prv->repl_timer;
> +
> +    ASSERT( vcpu_on_replq(svc) );
> +
> +    if ( deadline_queue_remove(replq, &svc->replq_elem) )
> +    {
> +        if ( deadline_queue_insert(&replq_elem, svc, &svc-
> >replq_elem, replq) )
> +            set_timer(repl_timer, svc->cur_deadline);
> +        else
> +        {
> +            struct rt_vcpu *svc_next = replq_elem(replq->next);
> +            set_timer(repl_timer, svc_next->cur_deadline);
> +        }
>      }
> +    else if ( deadline_queue_insert(&replq_elem, svc, &svc-
> >replq_elem, replq) )
> +        set_timer(repl_timer, svc->cur_deadline);
>
This looks better, and it should work (but please double check, haven't
tested it):

static void
replq_reinsert(const struct scheduler *ops, struct rt_vcpu *svc)
{
    struct list_head *replq = rt_replq(ops);
    struct rt_vcpu *rearm_svc = svc;
    bool_t rearm = false;

    ASSERT( vcpu_on_replq(svc) );

    /*
     * If svc was at the front of the replenishment queue, we certainly
     * need to re-program the timer, and we want to use the deadline of
     * the vcpu which is now at the front of the queue (which may still
     * be svc or not).
     * 
     * We may also need to re-program, if svc has been put at the front
     * of the replenishment queue when being re-inserted.
     */
    if ( deadline_queue_remove(replq, &svc->replq_elem) )
    {
        deadline_replq_insert(svc, &svc->replq_elem, replq);
        rearm_svc = replq_elem(replq->next);
        rearm = true; 
    }
    else
        rearm = deadline_replq_insert(svc, &svc->replq_elem, replq);

    if ( rearm )
        set_timer(rt_priv(ops)->repl_timer, rearm_svc->cur_deadline);
}

> @@ -706,8 +888,15 @@ burn_budget(const struct scheduler *ops, struct
> rt_vcpu *svc, s_time_t now)
>  
>      svc->cur_budget -= delta;
>  
> -    if ( svc->cur_budget < 0 )
> +    if ( svc->cur_budget <= 0 )
> +    {    
>          svc->cur_budget = 0;
> +        /* 
> +         * Set __RTDS_was_depleted so the replenishment
> +         * handler can let this vcpu tickle if it was refilled.
> +         */
> +        set_bit(__RTDS_was_depleted, &svc->flags);
>
Kill the comment. You'll say what happens there... in there! :-)

> @@ -1150,6 +1301,100 @@ rt_dom_cntl(
>      return rc;
>  }
>  
> +/*
> + * The replenishment timer handler picks vcpus
> + * from the replq and does the actual replenishment.
> + */
> +static void repl_handler(void *data){
> +    s_time_t now = NOW();
> +    struct scheduler *ops = data; 
> +    struct rt_private *prv = rt_priv(ops);
> +    struct list_head *replq = rt_replq(ops);
> +    struct list_head *runq = rt_runq(ops);
> +    struct timer *repl_timer = prv->repl_timer;
> +    struct list_head *iter, *tmp;
> +    struct list_head tmp_replq;
> +    struct rt_vcpu *svc = NULL;
>
You don't need to initialize this, do you?

> +    spin_lock_irq(&prv->lock);
> +
> +    stop_timer(repl_timer);
> +
There is no need for stopping the timer inside the handler.

> +    /*
> +     * A temperary queue for updated vcpus.
> +     * It is used to tickle.
> +     */
> +    INIT_LIST_HEAD(&tmp_replq);
> +
You can use LIST_HEAD(tmp_replq); for defining and initializing. I'm
quite sure this is what I suggested, are you using this because you
tried and it didn't work?

The comment is not useful. It's quite clear from a number of factors
that it's a temporary list.

> +    list_for_each_safe ( iter, tmp, replq )
> +    {
> +        svc = replq_elem(iter);
> +
> +        if ( now < svc->cur_deadline )
> +            break;
> +
> +        rt_update_deadline(now, svc);
> +        /*
> +         * When a vcpu is replenished, it is moved
> +         * to a temporary queue to tickle.
> +         */
> +        list_del(&svc->replq_elem);
> +        list_add(&svc->replq_elem, &tmp_replq);
> +
           list_del(&svc->replq_elem);
           rt_update_deadline(now, svc);
           list_add(&svc->replq_elem, &tmp_replq);

Is more clear, IMO. Also, the comments. I think they need to be more
clear about what is happening in these two loops.

What I think is best is to get rid of all these pieces inside the
list_for_each(), and put one just one comment above it, explaining what
happens inside the loop (the same is true for the other below
list_for_each()).

> +        /*
> +         * If svc is on run queue, we need
> +         * to put it to the correct place since
> +         * its deadline changes.
> +         */
> +        if ( __vcpu_on_q(svc) )
> +        {
> +            /* put back to runq */
> +            __q_remove(svc);
> +            __runq_insert(ops, svc);
> +        }
> +    }
> +
> +    /* Iterate through the list of updated vcpus. */
> +    list_for_each_safe ( iter, tmp, &tmp_replq )
> +    {
> +        struct vcpu* vc;
> +        svc = replq_elem(iter);
> +        vc = svc->vcpu;
>
For vcpus, we should use variables named 'v'. But, anyway, I don't
think you need this (it's used only once!).

> +        if ( curr_on_cpu(vc->processor) == vc && 
> +             ( !list_empty(runq) ) )
>
So, this is because, since we don't keep the idle vcpus in the
runqueues, we need to catch the case where v is running, but no other
vcpu is waiting on the runqueue in runnable state, is this right?

In any case, parentheses around the second part of the && seems
pointless.

> +        {
> +            /*
> +             * If the vcpu is running, let the head
> +             * of the run queue tickle if it has a 
> +             * higher priority.
> +             */
> +            struct rt_vcpu *next_on_runq = __q_elem(runq->next);
>
Empty line here.

Also, the correct wording is "tickle the head of the runqueue". (But
see what I said above about comments for these loops.)

> +            if ( svc->cur_deadline > next_on_runq->cur_deadline )
> +                runq_tickle(ops, next_on_runq);
> +        }
> +        else if ( __vcpu_on_q(svc) )
> +        {
> +            /* Only need to tickle a vcpu that was depleted. */
> +            if ( test_and_clear_bit(__RTDS_was_depleted, &svc-
> >flags) )
>
These two if-s can be combined, can't they?


> +                runq_tickle(ops, svc);
> +        }
> +
> +        list_del(&svc->replq_elem);
> +        /* Move it back to the replenishment event queue. */
> +        deadline_queue_insert(&replq_elem, svc, &svc->replq_elem,
> replq);
> +    }
> +
> +    /* 
> +     * Use the vcpu that's on the top of the run queue.
>
It's not the run queue, it's the replenishment queue.

> +     * Otherwise don't program the timer.
> +     */
>
"Otherwise" what?

Maybe:

/*
 * If there are vcpus left in the replenishment event list,
 * set the next replenishment to happen at the deadline of
 * the one at the front.
 */

> +    if ( !list_empty(replq) )
> +        set_timer(repl_timer, replq_elem(replq->next)-
> >cur_deadline);
> +
> +    spin_unlock_irq(&prv->lock);
> +}
> +

Thanks and Regads,
Dario
Dario Faggioli March 16, 2016, 2:44 p.m. UTC | #7
On Wed, 2016-03-16 at 10:20 -0400, Meng Xu wrote:
> As to the comment, I will suggest:
> 
> /*
>  * RTDS_was_depleted: Is a vcpus budget depleted?
> 
>  * + Set in burn_budget() when a vcpus budget turns to zero
> 
>  * + Checked and cleared in repl_handler() to replenish the budget
> 
>  */
> 
> What do you think?
> 
Wow, we really are on the same page, I just suggested something quite
similar to this.

> BTW, how about other parts of the patch? Is there something that you
> don't like?
>
I just sent in my comments. I did have a few of them, I'm afraid.

Still, there weren't anything really substantial, from the logical or
algorithmical point of view. They almost all are about how to make both
the patch and the final result as easy to understand as possible.

This patch is doing some important and serious rework, of a piece of
code which is not in the best possible shape, so I consider the above
really really important.

> I think the invalid budget returned in rt_schedule() in this patch is
> a serious logical bug.
>
It's a bug. And you did a good job catching it. That's it. :-)

> If all of my comments are solved, I think it is in a good state and
> I'm considering to send the reviewed-by tag in the near future.
>
Sure!

I don't expect it to take much longer for me to also be able to give
green light. :-)

> However, I won't send the reviewed-by until I get your confirmation,
> since this will be the first reviewed-by I will be giving as
> maintainer. I'd like to take a safe step. :-D
> 
Mmm... That's not how it works, though. I know the feeling, and we've
all been there, I guess. However, you see and are saying yourself
already that such attitude from you can't really fly, in the long run.

Do as you wish, but you must know that you lost the right of "taking
safe steps" when you've become part of the MAINTAINERS file. :-P

So, jokes aside, it's perfectly fine that, as soon as you're happy
about the status of the patch, you send in your tag. In this specific
case, even if you are a maintainer, that would not mean the patch could
go in, because there are outstanding comments from another maintainer. 

And yes, this means that co-maintainers are not entirely agreeing on
the readiness of a particular patch, but it's not at all a big deal...
It happens quite frequently, actually! :-)

So, again, I understand the feeling... But it really reduces to "with
great powers comes great responsibility". And just like in the movie
(which I don't even like that much!! :-/), the sooner you accept that,
the better. For you. For us. For the project. For the World. :-D :-D

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Tianyang Chen March 16, 2016, 3:43 p.m. UTC | #8
On 03/16/2016 10:25 AM, Dario Faggioli wrote:
> Ok, it's about time that we deal with this changelog!
>
> On Mon, 2016-03-14 at 20:04 -0400, Tianyang Chen wrote:
>> Budget replenishment and enforcement are separated by adding
>> a replenishment timer, which fires at the next most imminent
>> release time of all runnable vcpus.
>>
> First of all, state (quickly) the "problems". So, right now:
>   - the scheduler, although the algorithm is event driven by nature,
>     follow a time driven model (is invoked periodically!), making the
>     code looks unnatural;
>   - budget replenishment logic, budget enforcement logic and scheduling
>     decisions are mixed and entangled, making the code hard to
>     understand;
>   - the various queues of vcpus are scanned various times, making the
>     code inefficient;
>
> Then describe your solution. The first sentence you've got up above is
> ok...
>
>> A replenishment queue has been added to keep track of all vcpus that
>> are runnable.
>>
> ...and this one too.
>
>> The following functions have major changes to manage the
>> replenishment
>> events and timer.
>>
>> repl_handler(): It is a timer handler which is re-programmed
>> to fire at the nearest vcpu deadline to replenish vcpus.
>>
>> rt_schedule(): picks the highest runnable vcpu based on cpu
>> affinity and ret.time will be passed to schedule(). If an idle
>> vcpu is picked, -1 is returned to avoid busy-waiting. repl_update()
>> has been removed.
>>
>> rt_vcpu_wake(): when a vcpu wakes up, it tickles instead of
>> picking one from the run queue.
>>
>> rt_context_saved(): when context switching is finished, the
>> preempted vcpu will be put back into the runq.
>>
> This is too detailed for a changelog. If you want this information to
> live somewhere (it already lives in the list archives, actually), make
> a cover letter (this is just one patch, so it's not required, but
> nothing forbids that). Or put it in a wiki page. Or write a blog post.
> Or (which would be kind of nice) all of them! :-)
>

I was thinking about this as well. The wiki will look a bit outdated for 
RTDS if this patch goes through. Meng: Do you think we should put extra 
information in the wiki? Someone's gotta update it sooner or later.

Thanks for the review Dario. I will put everything together soon.

Tianyang
Dario Faggioli March 16, 2016, 4:11 p.m. UTC | #9
On Wed, 2016-03-16 at 11:43 -0400, Chen, Tianyang wrote:
> 
> On 03/16/2016 10:25 AM, Dario Faggioli wrote:
> > 
> > This is too detailed for a changelog. If you want this information
> > to
> > live somewhere (it already lives in the list archives, actually),
> > make
> > a cover letter (this is just one patch, so it's not required, but
> > nothing forbids that). Or put it in a wiki page. Or write a blog
> > post.
> > Or (which would be kind of nice) all of them! :-)
> > 
> I was thinking about this as well. The wiki will look a bit outdated
> for 
> RTDS if this patch goes through. Meng: Do you think we should put
> extra 
> information in the wiki? Someone's gotta update it sooner or later.
> 
No need to ask. More information / more updated information on the wiki
is just __always__ a good thing.

So, feel free to go ahead and make it so. If no one of you has write
access, there is a procedure in the wiki itself to request for that (or
just create an user and tell it to me, I think I should be able to
enable that for you).

Regards,
Dario
Meng Xu March 16, 2016, 8:45 p.m. UTC | #10
>>
>> sched_rt.c TIMER_SOFTIRQ
>>     replenishment_timer_handler()
>>         [spin_lock]
>>         <for_each_vcpu_on_q(i)> {
>>             replenish(i)
>>             runq_tickle(i)
>>         }>
>>         program_timer()
>>         [spin_lock]
>>
> And kill this (or, again, move to cover, wiki, blog, etc.) as well.
>
> Also, trying to apply this gave:
>
> /*
> <stdin>:194: trailing whitespace.
>  * A helper function that only removes a vcpu from a queue
> <stdin>:355: trailing whitespace.
>     /*
> <stdin>:356: trailing whitespace.
>      * The timer initialization will happen later when
> <stdin>:380: trailing whitespace.
>     {
> warning: squelched 10 whitespace errors
> warning: 15 lines add whitespace errors.
>
> And I think there are a couple of long lines as well.
Tianyang,

You can configure the git to mark whitespace in color.
One approach can be found at
http://stackoverflow.com/questions/5257553/coloring-white-space-in-git-diffs-output

Thanks,

Meng
Meng Xu March 16, 2016, 8:51 p.m. UTC | #11
On Wed, Mar 16, 2016 at 10:44 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Wed, 2016-03-16 at 10:20 -0400, Meng Xu wrote:
>> As to the comment, I will suggest:
>>
>> /*
>>  * RTDS_was_depleted: Is a vcpus budget depleted?
>>
>>  * + Set in burn_budget() when a vcpus budget turns to zero
>>
>>  * + Checked and cleared in repl_handler() to replenish the budget
>>
>>  */
>>
>> What do you think?
>>
> Wow, we really are on the same page, I just suggested something quite
> similar to this.

Right. :-)

>
>> BTW, how about other parts of the patch? Is there something that you
>> don't like?
>>
> I just sent in my comments. I did have a few of them, I'm afraid.

Thank you very much! I saw it! I see the point. I didn't think too
much about the better choice of the comments.
Those suggestions make the comment more useful... Thank you very much! :-)

>
> This patch is doing some important and serious rework, of a piece of
> code which is not in the best possible shape, so I consider the above
> really really important.

Right! But those are easy to solve. :-)

>
>> I think the invalid budget returned in rt_schedule() in this patch is
>> a serious logical bug.
>>
> It's a bug. And you did a good job catching it. That's it. :-)
>
>> If all of my comments are solved, I think it is in a good state and
>> I'm considering to send the reviewed-by tag in the near future.
>>
> Sure!
>
> I don't expect it to take much longer for me to also be able to give
> green light. :-)

Great!

>
>> However, I won't send the reviewed-by until I get your confirmation,
>> since this will be the first reviewed-by I will be giving as
>> maintainer. I'd like to take a safe step. :-D
>>
> Mmm... That's not how it works, though. I know the feeling, and we've
> all been there, I guess. However, you see and are saying yourself
> already that such attitude from you can't really fly, in the long run.

Right! I understood...

>
> Do as you wish, but you must know that you lost the right of "taking
> safe steps" when you've become part of the MAINTAINERS file. :-P

OK... Then I will send the tag when I feel comfortable with the patch then.

>
> So, jokes aside, it's perfectly fine that, as soon as you're happy
> about the status of the patch, you send in your tag. In this specific
> case, even if you are a maintainer, that would not mean the patch could
> go in, because there are outstanding comments from another maintainer.

I see. Good to know that. :-)

>
> And yes, this means that co-maintainers are not entirely agreeing on
> the readiness of a particular patch, but it's not at all a big deal...
> It happens quite frequently, actually! :-)
>
> So, again, I understand the feeling... But it really reduces to "with
> great powers comes great responsibility". And just like in the movie
> (which I don't even like that much!! :-/), the sooner you accept that,
> the better. For you. For us. For the project. For the World. :-D :-D

Sure! OK... No problem. The worst case is that I have to "kill the
bug" that passed me. :-D

Thank you very much!

Best Regards,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/
Meng Xu March 16, 2016, 8:54 p.m. UTC | #12
>> This is too detailed for a changelog. If you want this information to
>> live somewhere (it already lives in the list archives, actually), make
>> a cover letter (this is just one patch, so it's not required, but
>> nothing forbids that). Or put it in a wiki page. Or write a blog post.
>> Or (which would be kind of nice) all of them! :-)
>>
>
> I was thinking about this as well. The wiki will look a bit outdated for
> RTDS if this patch goes through. Meng: Do you think we should put extra
> information in the wiki? Someone's gotta update it sooner or later.
>
> Thanks for the review Dario. I will put everything together soon.

You can just register an account on Xen wiki and you can update it directly...

It's always good to have more detailed correct information on the wiki.

Thanks and best regards,

Meng
Tianyang Chen March 17, 2016, 2:24 a.m. UTC | #13
On 03/16/2016 10:25 AM, Dario Faggioli wrote:
>> +        if ( curr_on_cpu(vc->processor) == vc &&
>> >+             ( !list_empty(runq) ) )
>> >
> So, this is because, since we don't keep the idle vcpus in the
> runqueues, we need to catch the case where v is running, but no other
> vcpu is waiting on the runqueue in runnable state, is this right?
>
> In any case, parentheses around the second part of the && seems
> pointless.
>

Yes. If there is no vcpu waiting on the run queue then there is no need 
to compare deadlines and tickle.

Tianyang
diff mbox

Patch

diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index bfed2e2..b491915 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -3,7 +3,9 @@ 
  * EDF scheduling is a real-time scheduling algorithm used in embedded field.
  *
  * by Sisu Xi, 2013, Washington University in Saint Louis
- * and Meng Xu, 2014, University of Pennsylvania
+ * Meng Xu, 2014-2016, University of Pennsylvania
+ * Tianyang Chen, 2016, University of Pennsylvania
+ * Dagaen Golomb, 2016, University of Pennsylvania
  *
  * based on the code of credit Scheduler
  */
@@ -16,6 +18,7 @@ 
 #include <xen/delay.h>
 #include <xen/event.h>
 #include <xen/time.h>
+#include <xen/timer.h>
 #include <xen/perfc.h>
 #include <xen/sched-if.h>
 #include <xen/softirq.h>
@@ -87,7 +90,7 @@ 
 #define RTDS_DEFAULT_BUDGET     (MICROSECS(4000))
 
 #define UPDATE_LIMIT_SHIFT      10
-#define MAX_SCHEDULE            (MILLISECS(1))
+
 /*
  * Flags
  */
@@ -115,6 +118,18 @@ 
 #define RTDS_delayed_runq_add (1<<__RTDS_delayed_runq_add)
 
 /*
+ * The replenishment timer handler needs to check this bit
+ * to know where a replenished vcpu was, when deciding which
+ * vcpu should tickle.
+ * A replenished vcpu should tickle if it was moved from the
+ * depleted queue to the run queue.
+ * + Set in burn_budget() if a vcpu has zero budget left.
+ * + Cleared and checked in the repenishment handler.
+ */
+#define __RTDS_was_depleted     3
+#define RTDS_was_depleted (1<<__RTDS_was_depleted)
+
+/*
  * rt tracing events ("only" 512 available!). Check
  * include/public/trace.h for more details.
  */
@@ -142,6 +157,9 @@  static cpumask_var_t *_cpumask_scratch;
  */
 static unsigned int nr_rt_ops;
 
+/* handler for the replenishment timer */
+static void repl_handler(void *data);
+
 /*
  * Systme-wide private data, include global RunQueue/DepletedQ
  * Global lock is referenced by schedule_data.schedule_lock from all
@@ -152,7 +170,9 @@  struct rt_private {
     struct list_head sdom;      /* list of availalbe domains, used for dump */
     struct list_head runq;      /* ordered list of runnable vcpus */
     struct list_head depletedq; /* unordered list of depleted vcpus */
+    struct list_head replq;     /* ordered list of vcpus that need replenishment */
     cpumask_t tickled;          /* cpus been tickled */
+    struct timer *repl_timer;   /* replenishment timer */
 };
 
 /*
@@ -160,6 +180,7 @@  struct rt_private {
  */
 struct rt_vcpu {
     struct list_head q_elem;    /* on the runq/depletedq list */
+    struct list_head replq_elem; /* on the repl event list */
 
     /* Up-pointers */
     struct rt_dom *sdom;
@@ -213,8 +234,14 @@  static inline struct list_head *rt_depletedq(const struct scheduler *ops)
     return &rt_priv(ops)->depletedq;
 }
 
+static inline struct list_head *rt_replq(const struct scheduler *ops)
+{
+    return &rt_priv(ops)->replq;
+}
+
 /*
- * Queue helper functions for runq and depletedq
+ * Helper functions for manipulating the runqueue, the depleted queue,
+ * and the replenishment events queue.
  */
 static int
 __vcpu_on_q(const struct rt_vcpu *svc)
@@ -228,6 +255,18 @@  __q_elem(struct list_head *elem)
     return list_entry(elem, struct rt_vcpu, q_elem);
 }
 
+static struct rt_vcpu *
+replq_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct rt_vcpu, replq_elem);
+}
+
+static int
+vcpu_on_replq(const struct rt_vcpu *svc)
+{
+    return !list_empty(&svc->replq_elem);
+}
+
 /*
  * Debug related code, dump vcpu/cpu information
  */
@@ -288,7 +327,7 @@  rt_dump_pcpu(const struct scheduler *ops, int cpu)
 static void
 rt_dump(const struct scheduler *ops)
 {
-    struct list_head *runq, *depletedq, *iter;
+    struct list_head *runq, *depletedq, *replq, *iter;
     struct rt_private *prv = rt_priv(ops);
     struct rt_vcpu *svc;
     struct rt_dom *sdom;
@@ -301,6 +340,7 @@  rt_dump(const struct scheduler *ops)
 
     runq = rt_runq(ops);
     depletedq = rt_depletedq(ops);
+    replq = rt_replq(ops);
 
     printk("Global RunQueue info:\n");
     list_for_each( iter, runq )
@@ -316,6 +356,13 @@  rt_dump(const struct scheduler *ops)
         rt_dump_vcpu(ops, svc);
     }
 
+    printk("Global Replenishment Event info:\n");
+    list_for_each ( iter, replq )
+    {
+        svc = replq_elem(iter);
+        rt_dump_vcpu(ops, svc);
+    }
+
     printk("Domain info:\n");
     list_for_each( iter, &prv->sdom )
     {
@@ -380,11 +427,78 @@  rt_update_deadline(s_time_t now, struct rt_vcpu *svc)
     return;
 }
 
+/* 
+ * A helper function that only removes a vcpu from a queue 
+ * and it returns 1 if the vcpu was in front of the list.
+ */
+static inline int
+deadline_queue_remove(struct list_head *queue, struct list_head *elem)
+{
+    int pos = 0;
+
+    if ( queue->next != elem )
+        pos = 1;
+
+    list_del_init(elem);
+    return !pos;
+}
+
 static inline void
 __q_remove(struct rt_vcpu *svc)
 {
-    if ( __vcpu_on_q(svc) )
-        list_del_init(&svc->q_elem);
+    ASSERT( __vcpu_on_q(svc) );
+    list_del_init(&svc->q_elem);
+}
+
+/*
+ * Removing a vcpu from the replenishment queue could
+ * re-program the timer for the next replenishment event
+ * if it was at the front of the list.
+ */
+static inline void
+__replq_remove(const struct scheduler *ops, struct rt_vcpu *svc)
+{
+    struct rt_private *prv = rt_priv(ops);
+    struct list_head *replq = rt_replq(ops);
+    struct timer* repl_timer = prv->repl_timer;
+
+    ASSERT( vcpu_on_replq(svc) );
+
+    if ( deadline_queue_remove(replq, &svc->replq_elem) )
+    {
+        /* re-arm the timer for the next replenishment event */
+        if ( !list_empty(replq) )
+        {
+            struct rt_vcpu *svc_next = replq_elem(replq->next);
+            set_timer(repl_timer, svc_next->cur_deadline);
+        }
+        else
+            stop_timer(repl_timer);
+    }
+}
+
+/*
+ * An utility function that inserts a vcpu to a
+ * queue based on certain order (EDF). The returned
+ * value is 1 if a vcpu has been inserted to the
+ * front of a list.
+ */
+static int
+deadline_queue_insert(struct rt_vcpu * (*_get_q_elem)(struct list_head *elem),
+    struct rt_vcpu *svc, struct list_head *elem, struct list_head *queue)
+{
+    struct list_head *iter;
+    int pos = 0;
+
+    list_for_each ( iter, queue )
+    {
+        struct rt_vcpu * iter_svc = (*_get_q_elem)(iter);
+        if ( svc->cur_deadline <= iter_svc->cur_deadline )
+            break;
+        pos++;
+    }
+    list_add_tail(elem, iter);
+    return !pos;
 }
 
 /*
@@ -397,27 +511,62 @@  __runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
 {
     struct rt_private *prv = rt_priv(ops);
     struct list_head *runq = rt_runq(ops);
-    struct list_head *iter;
 
     ASSERT( spin_is_locked(&prv->lock) );
-
     ASSERT( !__vcpu_on_q(svc) );
+    ASSERT( vcpu_on_replq(svc) );
 
     /* add svc to runq if svc still has budget */
     if ( svc->cur_budget > 0 )
-    {
-        list_for_each(iter, runq)
-        {
-            struct rt_vcpu * iter_svc = __q_elem(iter);
-            if ( svc->cur_deadline <= iter_svc->cur_deadline )
-                    break;
-         }
-        list_add_tail(&svc->q_elem, iter);
-    }
+        deadline_queue_insert(&__q_elem, svc, &svc->q_elem, runq);
     else
-    {
         list_add(&svc->q_elem, &prv->depletedq);
+}
+
+/*
+ * Insert svc into the replenishment event list
+ * in replenishment time order.
+ * vcpus that need to be replished earlier go first.
+ * The timer may be re-programmed if svc is inserted
+ * at the front of the event list.
+ */
+static void
+__replq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
+{
+    struct list_head *replq = rt_replq(ops);
+    struct rt_private *prv = rt_priv(ops);
+    struct timer *repl_timer = prv->repl_timer;
+
+    ASSERT( !vcpu_on_replq(svc) );
+
+    if ( deadline_queue_insert(&replq_elem, svc, &svc->replq_elem, replq) )
+        set_timer(repl_timer, svc->cur_deadline);
+}
+
+/*
+ * Removes and re-inserts an event to the replenishment queue.
+ */
+static void
+replq_reinsert(const struct scheduler *ops, struct rt_vcpu *svc)
+{
+    struct list_head *replq = rt_replq(ops);
+    struct rt_private *prv = rt_priv(ops);
+    struct timer *repl_timer = prv->repl_timer;
+
+    ASSERT( vcpu_on_replq(svc) );
+
+    if ( deadline_queue_remove(replq, &svc->replq_elem) )
+    {
+        if ( deadline_queue_insert(&replq_elem, svc, &svc->replq_elem, replq) )
+            set_timer(repl_timer, svc->cur_deadline);
+        else
+        {
+            struct rt_vcpu *svc_next = replq_elem(replq->next);
+            set_timer(repl_timer, svc_next->cur_deadline);
+        }
     }
+    else if ( deadline_queue_insert(&replq_elem, svc, &svc->replq_elem, replq) )
+        set_timer(repl_timer, svc->cur_deadline);
 }
 
 /*
@@ -449,11 +598,18 @@  rt_init(struct scheduler *ops)
     INIT_LIST_HEAD(&prv->sdom);
     INIT_LIST_HEAD(&prv->runq);
     INIT_LIST_HEAD(&prv->depletedq);
+    INIT_LIST_HEAD(&prv->replq);
 
     cpumask_clear(&prv->tickled);
 
     ops->sched_data = prv;
 
+    /* 
+     * The timer initialization will happen later when 
+     * the first pcpu is added to this pool in alloc_pdata.
+     */
+    prv->repl_timer = NULL;
+
     return 0;
 
  no_mem:
@@ -473,6 +629,10 @@  rt_deinit(struct scheduler *ops)
         xfree(_cpumask_scratch);
         _cpumask_scratch = NULL;
     }
+
+    kill_timer(prv->repl_timer);
+    xfree(prv->repl_timer);
+
     ops->sched_data = NULL;
     xfree(prv);
 }
@@ -494,6 +654,17 @@  rt_alloc_pdata(const struct scheduler *ops, int cpu)
     if ( !alloc_cpumask_var(&_cpumask_scratch[cpu]) )
         return NULL;
 
+    if ( prv->repl_timer == NULL )
+    {   
+        /* Allocate the timer on the first cpu of this pool. */
+        prv->repl_timer = xzalloc(struct timer);
+
+        if ( prv->repl_timer == NULL )
+            return NULL;
+
+        init_timer(prv->repl_timer, repl_handler, (void *)ops, cpu);
+    }
+
     /* 1 indicates alloc. succeed in schedule.c */
     return (void *)1;
 }
@@ -587,6 +758,7 @@  rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
         return NULL;
 
     INIT_LIST_HEAD(&svc->q_elem);
+    INIT_LIST_HEAD(&svc->replq_elem);
     svc->flags = 0U;
     svc->sdom = dd;
     svc->vcpu = vc;
@@ -610,7 +782,8 @@  rt_free_vdata(const struct scheduler *ops, void *priv)
 }
 
 /*
- * This function is called in sched_move_domain() in schedule.c
+ * It is called in sched_move_domain() and sched_init_vcpu
+ * in schedule.c.
  * When move a domain to a new cpupool.
  * It inserts vcpus of moving domain to the scheduler's RunQ in
  * dest. cpupool.
@@ -628,8 +801,13 @@  rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
     if ( now >= svc->cur_deadline )
         rt_update_deadline(now, svc);
 
-    if ( !__vcpu_on_q(svc) && vcpu_runnable(vc) && !vc->is_running )
-        __runq_insert(ops, svc);
+    if ( !__vcpu_on_q(svc) && vcpu_runnable(vc) )
+    {
+        __replq_insert(ops, svc);
+ 
+        if ( !vc->is_running )
+            __runq_insert(ops, svc);
+    }
     vcpu_schedule_unlock_irq(lock, vc);
 
     SCHED_STAT_CRANK(vcpu_insert);
@@ -652,6 +830,10 @@  rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
     lock = vcpu_schedule_lock_irq(vc);
     if ( __vcpu_on_q(svc) )
         __q_remove(svc);
+
+    if ( vcpu_on_replq(svc) )
+        __replq_remove(ops,svc);
+
     vcpu_schedule_unlock_irq(lock, vc);
 }
 
@@ -706,8 +888,15 @@  burn_budget(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now)
 
     svc->cur_budget -= delta;
 
-    if ( svc->cur_budget < 0 )
+    if ( svc->cur_budget <= 0 )
+    {    
         svc->cur_budget = 0;
+        /* 
+         * Set __RTDS_was_depleted so the replenishment
+         * handler can let this vcpu tickle if it was refilled.
+         */
+        set_bit(__RTDS_was_depleted, &svc->flags);
+    }
 
     /* TRACE */
     {
@@ -784,44 +973,6 @@  __runq_pick(const struct scheduler *ops, const cpumask_t *mask)
 }
 
 /*
- * Update vcpu's budget and
- * sort runq by insert the modifed vcpu back to runq
- * lock is grabbed before calling this function
- */
-static void
-__repl_update(const struct scheduler *ops, s_time_t now)
-{
-    struct list_head *runq = rt_runq(ops);
-    struct list_head *depletedq = rt_depletedq(ops);
-    struct list_head *iter;
-    struct list_head *tmp;
-    struct rt_vcpu *svc = NULL;
-
-    list_for_each_safe(iter, tmp, runq)
-    {
-        svc = __q_elem(iter);
-        if ( now < svc->cur_deadline )
-            break;
-
-        rt_update_deadline(now, svc);
-        /* reinsert the vcpu if its deadline is updated */
-        __q_remove(svc);
-        __runq_insert(ops, svc);
-    }
-
-    list_for_each_safe(iter, tmp, depletedq)
-    {
-        svc = __q_elem(iter);
-        if ( now >= svc->cur_deadline )
-        {
-            rt_update_deadline(now, svc);
-            __q_remove(svc); /* remove from depleted queue */
-            __runq_insert(ops, svc); /* add to runq */
-        }
-    }
-}
-
-/*
  * schedule function for rt scheduler.
  * The lock is already grabbed in schedule.c, no need to lock here
  */
@@ -840,8 +991,6 @@  rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
     /* burn_budget would return for IDLE VCPU */
     burn_budget(ops, scurr, now);
 
-    __repl_update(ops, now);
-
     if ( tasklet_work_scheduled )
     {
         trace_var(TRC_RTDS_SCHED_TASKLET, 1, 0,  NULL);
@@ -868,6 +1017,7 @@  rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
         set_bit(__RTDS_delayed_runq_add, &scurr->flags);
 
     snext->last_start = now;
+    ret.time =  -1; /* if an idle vcpu is picked */ 
     if ( !is_idle_vcpu(snext->vcpu) )
     {
         if ( snext != scurr )
@@ -880,9 +1030,8 @@  rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_sched
             snext->vcpu->processor = cpu;
             ret.migrated = 1;
         }
+        ret.time = snext->budget; /* invoke the scheduler next time */
     }
-
-    ret.time = MIN(snext->budget, MAX_SCHEDULE); /* sched quantum */
     ret.task = snext->vcpu;
 
     return ret;
@@ -903,7 +1052,10 @@  rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
     if ( curr_on_cpu(vc->processor) == vc )
         cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
     else if ( __vcpu_on_q(svc) )
+    {
         __q_remove(svc);
+        __replq_remove(ops, svc);
+    }
     else if ( svc->flags & RTDS_delayed_runq_add )
         clear_bit(__RTDS_delayed_runq_add, &svc->flags);
 }
@@ -1008,10 +1160,7 @@  rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
 {
     struct rt_vcpu * const svc = rt_vcpu(vc);
     s_time_t now = NOW();
-    struct rt_private *prv = rt_priv(ops);
-    struct rt_vcpu *snext = NULL; /* highest priority on RunQ */
-    struct rt_dom *sdom = NULL;
-    cpumask_t *online;
+    bool_t missed;
 
     BUG_ON( is_idle_vcpu(vc) );
 
@@ -1033,32 +1182,42 @@  rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     else
         SCHED_STAT_CRANK(vcpu_wake_not_runnable);
 
-    /* If context hasn't been saved for this vcpu yet, we can't put it on
-     * the Runqueue/DepletedQ. Instead, we set a flag so that it will be
-     * put on the Runqueue/DepletedQ after the context has been saved.
+    /*
+     * If a deadline passed while svc was asleep/blocked, we need new
+     * scheduling parameters (a new deadline and full budget).
+     */
+
+    missed = ( now >= svc->cur_deadline );
+    if ( missed )
+        rt_update_deadline(now, svc);
+
+    /* 
+     * If context hasn't been saved for this vcpu yet, we can't put it on
+     * the run-queue/depleted-queue. Instead, we set the appropriate flag,
+     * the vcpu will be put back on queue after the context has been saved
+     * (in rt_context_save()).
      */
     if ( unlikely(svc->flags & RTDS_scheduled) )
     {
         set_bit(__RTDS_delayed_runq_add, &svc->flags);
+        /*
+         * The vcpu is waking up already, and we didn't even had the time to
+         * remove its next replenishment event from the replenishment queue
+         * when it blocked! No big deal. If we did not miss the deadline in
+         * the meantime, let's just leave it there. If we did, let's remove it
+         * and queue a new one (to occur at our new deadline).
+         */
+        if ( missed )
+           replq_reinsert(ops, svc);
         return;
     }
 
-    if ( now >= svc->cur_deadline)
-        rt_update_deadline(now, svc);
-
+    /* Replenishment event got cancelled when we blocked. Add it back. */
+    __replq_insert(ops, svc); 
     /* insert svc to runq/depletedq because svc is not in queue now */
     __runq_insert(ops, svc);
 
-    __repl_update(ops, now);
-
-    ASSERT(!list_empty(&prv->sdom));
-    sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
-    online = cpupool_domain_cpumask(sdom->dom);
-    snext = __runq_pick(ops, online); /* pick snext from ALL valid cpus */
-
-    runq_tickle(ops, snext);
-
-    return;
+    runq_tickle(ops, svc);
 }
 
 /*
@@ -1069,10 +1228,6 @@  static void
 rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
 {
     struct rt_vcpu *svc = rt_vcpu(vc);
-    struct rt_vcpu *snext = NULL;
-    struct rt_dom *sdom = NULL;
-    struct rt_private *prv = rt_priv(ops);
-    cpumask_t *online;
     spinlock_t *lock = vcpu_schedule_lock_irq(vc);
 
     clear_bit(__RTDS_scheduled, &svc->flags);
@@ -1084,15 +1239,11 @@  rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
          likely(vcpu_runnable(vc)) )
     {
         __runq_insert(ops, svc);
-        __repl_update(ops, NOW());
-
-        ASSERT(!list_empty(&prv->sdom));
-        sdom = list_entry(prv->sdom.next, struct rt_dom, sdom_elem);
-        online = cpupool_domain_cpumask(sdom->dom);
-        snext = __runq_pick(ops, online); /* pick snext from ALL cpus */
-
-        runq_tickle(ops, snext);
+        runq_tickle(ops, svc);
     }
+    else
+        __replq_remove(ops, svc);
+
 out:
     vcpu_schedule_unlock_irq(lock, vc);
 }
@@ -1150,6 +1301,100 @@  rt_dom_cntl(
     return rc;
 }
 
+/*
+ * The replenishment timer handler picks vcpus
+ * from the replq and does the actual replenishment.
+ */
+static void repl_handler(void *data){
+    s_time_t now = NOW();
+    struct scheduler *ops = data; 
+    struct rt_private *prv = rt_priv(ops);
+    struct list_head *replq = rt_replq(ops);
+    struct list_head *runq = rt_runq(ops);
+    struct timer *repl_timer = prv->repl_timer;
+    struct list_head *iter, *tmp;
+    struct list_head tmp_replq;
+    struct rt_vcpu *svc = NULL;
+
+    spin_lock_irq(&prv->lock);
+
+    stop_timer(repl_timer);
+
+    /*
+     * A temperary queue for updated vcpus.
+     * It is used to tickle.
+     */
+    INIT_LIST_HEAD(&tmp_replq);
+
+    list_for_each_safe ( iter, tmp, replq )
+    {
+        svc = replq_elem(iter);
+
+        if ( now < svc->cur_deadline )
+            break;
+
+        rt_update_deadline(now, svc);
+        /*
+         * When a vcpu is replenished, it is moved
+         * to a temporary queue to tickle.
+         */
+        list_del(&svc->replq_elem);
+        list_add(&svc->replq_elem, &tmp_replq);
+
+        /*
+         * If svc is on run queue, we need
+         * to put it to the correct place since
+         * its deadline changes.
+         */
+        if ( __vcpu_on_q(svc) )
+        {
+            /* put back to runq */
+            __q_remove(svc);
+            __runq_insert(ops, svc);
+        }
+    }
+
+    /* Iterate through the list of updated vcpus. */
+    list_for_each_safe ( iter, tmp, &tmp_replq )
+    {
+        struct vcpu* vc;
+        svc = replq_elem(iter);
+        vc = svc->vcpu;
+
+        if ( curr_on_cpu(vc->processor) == vc && 
+             ( !list_empty(runq) ) )
+        {
+            /*
+             * If the vcpu is running, let the head
+             * of the run queue tickle if it has a 
+             * higher priority.
+             */
+            struct rt_vcpu *next_on_runq = __q_elem(runq->next);
+            if ( svc->cur_deadline > next_on_runq->cur_deadline )
+                runq_tickle(ops, next_on_runq);
+        }
+        else if ( __vcpu_on_q(svc) )
+        {
+            /* Only need to tickle a vcpu that was depleted. */
+            if ( test_and_clear_bit(__RTDS_was_depleted, &svc->flags) )
+                runq_tickle(ops, svc);
+        }
+
+        list_del(&svc->replq_elem);
+        /* Move it back to the replenishment event queue. */
+        deadline_queue_insert(&replq_elem, svc, &svc->replq_elem, replq);
+    }
+
+    /* 
+     * Use the vcpu that's on the top of the run queue.
+     * Otherwise don't program the timer.
+     */
+    if ( !list_empty(replq) )
+        set_timer(repl_timer, replq_elem(replq->next)->cur_deadline);
+
+    spin_unlock_irq(&prv->lock);
+}
+
 static const struct scheduler sched_rtds_def = {
     .name           = "SMP RTDS Scheduler",
     .opt_name       = "rtds",