[v3,3/3] xen/sched: fix cpu hotplug

Message ID	20220816101317.23014-4-jgross@suse.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: Juergen Gross <jgross@suse.com> To: xen-devel@lists.xenproject.org Cc: Juergen Gross <jgross@suse.com>, George Dunlap <george.dunlap@citrix.com>, Dario Faggioli <dfaggioli@suse.com>, Gao Ruifeng <ruifeng.gao@intel.com>, Jan Beulich <jbeulich@suse.com> Subject: [PATCH v3 3/3] xen/sched: fix cpu hotplug Date: Tue, 16 Aug 2022 12:13:17 +0200 Message-Id: <20220816101317.23014-4-jgross@suse.com> In-Reply-To: <20220816101317.23014-1-jgross@suse.com> References: <20220816101317.23014-1-jgross@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	xen/sched: fix cpu hotplug \| expand [v3,0/3] xen/sched: fix cpu hotplug [v3,1/3] xen/sched: introduce cpupool_update_node_affinity() [v3,2/3] xen/sched: carve out memory allocation and freeing from schedule_cpu_rm() [v3,3/3] xen/sched: fix cpu hotplug

Message ID

20220816101317.23014-4-jgross@suse.com (mailing list archive)

State

Superseded

Headers

Errors-To: xen-devel-bounces@lists.xenproject.org
Precedence: list
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
From: Juergen Gross <jgross@suse.com>
To: xen-devel@lists.xenproject.org
Cc: Juergen Gross <jgross@suse.com>,
	George Dunlap <george.dunlap@citrix.com>,
	Dario Faggioli <dfaggioli@suse.com>,
	Gao Ruifeng <ruifeng.gao@intel.com>,
	Jan Beulich <jbeulich@suse.com>
Subject: [PATCH v3 3/3] xen/sched: fix cpu hotplug
Date: Tue, 16 Aug 2022 12:13:17 +0200
Message-Id: <20220816101317.23014-4-jgross@suse.com>
In-Reply-To: <20220816101317.23014-1-jgross@suse.com>
References: <20220816101317.23014-1-jgross@suse.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

xen/sched: fix cpu hotplug | expand

Commit Message

Jürgen Groß Aug. 16, 2022, 10:13 a.m. UTC

Cpu cpu unplugging is calling schedule_cpu_rm() via stop_machine_run()
with interrupts disabled, thus any memory allocation or freeing must
be avoided.

Since commit 5047cd1d5dea ("xen/common: Use enhanced
ASSERT_ALLOC_CONTEXT in xmalloc()") this restriction is being enforced
via an assertion, which will now fail.

Before that commit cpu unplugging in normal configurations was working
just by chance as only the cpu performing schedule_cpu_rm() was doing
active work. With core scheduling enabled, however, failures could
result from memory allocations not being properly propagated to other
cpus' TLBs.

Fix this mess by allocating needed memory before entering
stop_machine_run() and freeing any memory only after having finished
stop_machine_run().

Fixes: 1ec410112cdd ("xen/sched: support differing granularity in schedule_cpu_[add/rm]()")
Reported-by: Gao Ruifeng <ruifeng.gao@intel.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V2:
- move affinity mask allocation into schedule_cpu_rm_alloc() (Jan Beulich)
---
 xen/common/sched/core.c    | 27 +++++++++++----
 xen/common/sched/cpupool.c | 68 +++++++++++++++++++++++++++++---------
 xen/common/sched/private.h |  5 ++-
 3 files changed, 78 insertions(+), 22 deletions(-)

Comments

Andrew Cooper Aug. 31, 2022, 10:52 p.m. UTC | #1

On 16/08/2022 11:13, Juergen Gross wrote:
> Cpu cpu unplugging is calling schedule_cpu_rm() via stop_machine_run()

Cpu cpu.

> with interrupts disabled, thus any memory allocation or freeing must
> be avoided.
>
> Since commit 5047cd1d5dea ("xen/common: Use enhanced
> ASSERT_ALLOC_CONTEXT in xmalloc()") this restriction is being enforced
> via an assertion, which will now fail.
>
> Before that commit cpu unplugging in normal configurations was working
> just by chance as only the cpu performing schedule_cpu_rm() was doing
> active work. With core scheduling enabled, however, failures could
> result from memory allocations not being properly propagated to other
> cpus' TLBs.

This isn't accurate, is it?  The problem with initiating a TLB flush
with IRQs disabled is that you can deadlock against a remote CPU which
is waiting for you to enable IRQs first to take a TLB flush IPI.

How does a memory allocation out of the xenheap result in a TLB flush? 
Even with split heaps, you're only potentially allocating into a new
slot which was unused...

> diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
> index 228470ac41..ffb2d6202b 100644
> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -3260,6 +3260,17 @@ static struct cpu_rm_data *schedule_cpu_rm_alloc(unsigned int cpu)
>      if ( !data )
>          goto out;
>  
> +    if ( aff_alloc )
> +    {
> +        if ( !update_node_aff_alloc(&data->affinity) )

I spent ages trying to figure out what this was doing, before realising
the problem is the function name.

alloc (as with free) is the critical piece of information and needs to
come first.  The fact we typically pass the result to
update_node_aff(inity) isn't relevant, and becomes actively wrong here
when we're nowhere near.

Patch 1 needs to name these helpers:

bool alloc_affinity_masks(struct affinity_masks *affinity);
void free_affinity_masks(struct affinity_masks *affinity);

and then patches 2 and 3 become far easier to follow.

Similarly in patch 2, the new helpers need to be
{alloc,free}_cpu_rm_data() to make sense.  These have nothing to do with
scheduling.

Also, you shouldn't introduce the helpers static in patch 2 and then
turn them non-static in patch 3.  That just adds unnecessary churn to
the complicated patch.

> +        {
> +            XFREE(data);
> +            goto out;
> +        }
> +    }
> +    else
> +        memset(&data->affinity, 0, sizeof(data->affinity));

I honestly don't think it is worth optimising xzalloc() -> xmalloc() 
for the cognitive complexity of having this logic here.

> diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
> index 58e082eb4c..2506861e4f 100644
> --- a/xen/common/sched/cpupool.c
> +++ b/xen/common/sched/cpupool.c
> @@ -411,22 +411,28 @@ int cpupool_move_domain(struct domain *d, struct cpupool *c)
>  }
>  
>  /* Update affinities of all domains in a cpupool. */
> -static void cpupool_update_node_affinity(const struct cpupool *c)
> +static void cpupool_update_node_affinity(const struct cpupool *c,
> +                                         struct affinity_masks *masks)
>  {
> -    struct affinity_masks masks;
> +    struct affinity_masks local_masks;
>      struct domain *d;
>  
> -    if ( !update_node_aff_alloc(&masks) )
> -        return;
> +    if ( !masks )
> +    {
> +        if ( !update_node_aff_alloc(&local_masks) )
> +            return;
> +        masks = &local_masks;
> +    }
>  
>      rcu_read_lock(&domlist_read_lock);
>  
>      for_each_domain_in_cpupool(d, c)
> -        domain_update_node_aff(d, &masks);
> +        domain_update_node_aff(d, masks);
>  
>      rcu_read_unlock(&domlist_read_lock);
>  
> -    update_node_aff_free(&masks);
> +    if ( masks == &local_masks )
> +        update_node_aff_free(masks);
>  }
>  
>  /*

Why do we need this at all?  domain_update_node_aff() already knows what
to do when passed NULL, so this seems like an awfully complicated no-op.

> @@ -1008,10 +1016,21 @@ static int cf_check cpu_callback(
>  {
>      unsigned int cpu = (unsigned long)hcpu;
>      int rc = 0;
> +    static struct cpu_rm_data *mem;
>  
>      switch ( action )
>      {
>      case CPU_DOWN_FAILED:
> +        if ( system_state <= SYS_STATE_active )
> +        {
> +            if ( mem )
> +            {

So, this does compile (and indeed I've tested the result), but I can't
see how it should.

mem is guaranteed to be uninitialised at this point, and ...

> +                schedule_cpu_rm_free(mem, cpu);
> +                mem = NULL;
> +            }
> +            rc = cpupool_cpu_add(cpu);
> +        }
> +        break;
>      case CPU_ONLINE:
>          if ( system_state <= SYS_STATE_active )
>              rc = cpupool_cpu_add(cpu);
> @@ -1019,12 +1038,31 @@ static int cf_check cpu_callback(
>      case CPU_DOWN_PREPARE:
>          /* Suspend/Resume don't change assignments of cpus to cpupools. */
>          if ( system_state <= SYS_STATE_active )
> +        {
>              rc = cpupool_cpu_remove_prologue(cpu);
> +            if ( !rc )
> +            {
> +                ASSERT(!mem);

... here, and each subsequent assertion too.

Given that I tested the patch and it does fix the IRQ assertion, I can
only imagine that it works by deterministically finding stack rubble
which happens to be 0.

~Andrew

Jürgen Groß Sept. 1, 2022, 6:11 a.m. UTC | #2

On 01.09.22 00:52, Andrew Cooper wrote:
> On 16/08/2022 11:13, Juergen Gross wrote:
>> Cpu cpu unplugging is calling schedule_cpu_rm() via stop_machine_run()
> 
> Cpu cpu.
> 
>> with interrupts disabled, thus any memory allocation or freeing must
>> be avoided.
>>
>> Since commit 5047cd1d5dea ("xen/common: Use enhanced
>> ASSERT_ALLOC_CONTEXT in xmalloc()") this restriction is being enforced
>> via an assertion, which will now fail.
>>
>> Before that commit cpu unplugging in normal configurations was working
>> just by chance as only the cpu performing schedule_cpu_rm() was doing
>> active work. With core scheduling enabled, however, failures could
>> result from memory allocations not being properly propagated to other
>> cpus' TLBs.
> 
> This isn't accurate, is it?  The problem with initiating a TLB flush
> with IRQs disabled is that you can deadlock against a remote CPU which
> is waiting for you to enable IRQs first to take a TLB flush IPI.

As long as only one cpu is trying to allocate/free memory during the
stop_machine_run() action the deadlock won't happen.

> How does a memory allocation out of the xenheap result in a TLB flush?
> Even with split heaps, you're only potentially allocating into a new
> slot which was unused...

Yeah, you are right. The main problem would occur only when a virtual
address is changed to point at another physical address, which should be
quite unlikely.

I can drop that paragraph, as it doesn't really help.

> 
>> diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
>> index 228470ac41..ffb2d6202b 100644
>> --- a/xen/common/sched/core.c
>> +++ b/xen/common/sched/core.c
>> @@ -3260,6 +3260,17 @@ static struct cpu_rm_data *schedule_cpu_rm_alloc(unsigned int cpu)
>>       if ( !data )
>>           goto out;
>>   
>> +    if ( aff_alloc )
>> +    {
>> +        if ( !update_node_aff_alloc(&data->affinity) )
> 
> I spent ages trying to figure out what this was doing, before realising
> the problem is the function name.
> 
> alloc (as with free) is the critical piece of information and needs to
> come first.  The fact we typically pass the result to
> update_node_aff(inity) isn't relevant, and becomes actively wrong here
> when we're nowhere near.
> 
> Patch 1 needs to name these helpers:
> 
> bool alloc_affinity_masks(struct affinity_masks *affinity);
> void free_affinity_masks(struct affinity_masks *affinity);
> 
> and then patches 2 and 3 become far easier to follow.
> 
> Similarly in patch 2, the new helpers need to be
> {alloc,free}_cpu_rm_data() to make sense.  These have nothing to do with
> scheduling.
> 
> Also, you shouldn't introduce the helpers static in patch 2 and then
> turn them non-static in patch 3.  That just adds unnecessary churn to
> the complicated patch.

Okay to all of above.

> 
>> +        {
>> +            XFREE(data);
>> +            goto out;
>> +        }
>> +    }
>> +    else
>> +        memset(&data->affinity, 0, sizeof(data->affinity));
> 
> I honestly don't think it is worth optimising xzalloc() -> xmalloc()
> for the cognitive complexity of having this logic here.

I don't mind either way. This logic is the result of one of Jan's comments.

> 
>> diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
>> index 58e082eb4c..2506861e4f 100644
>> --- a/xen/common/sched/cpupool.c
>> +++ b/xen/common/sched/cpupool.c
>> @@ -411,22 +411,28 @@ int cpupool_move_domain(struct domain *d, struct cpupool *c)
>>   }
>>   
>>   /* Update affinities of all domains in a cpupool. */
>> -static void cpupool_update_node_affinity(const struct cpupool *c)
>> +static void cpupool_update_node_affinity(const struct cpupool *c,
>> +                                         struct affinity_masks *masks)
>>   {
>> -    struct affinity_masks masks;
>> +    struct affinity_masks local_masks;
>>       struct domain *d;
>>   
>> -    if ( !update_node_aff_alloc(&masks) )
>> -        return;
>> +    if ( !masks )
>> +    {
>> +        if ( !update_node_aff_alloc(&local_masks) )
>> +            return;
>> +        masks = &local_masks;
>> +    }
>>   
>>       rcu_read_lock(&domlist_read_lock);
>>   
>>       for_each_domain_in_cpupool(d, c)
>> -        domain_update_node_aff(d, &masks);
>> +        domain_update_node_aff(d, masks);
>>   
>>       rcu_read_unlock(&domlist_read_lock);
>>   
>> -    update_node_aff_free(&masks);
>> +    if ( masks == &local_masks )
>> +        update_node_aff_free(masks);
>>   }
>>   
>>   /*
> 
> Why do we need this at all?  domain_update_node_aff() already knows what
> to do when passed NULL, so this seems like an awfully complicated no-op.

You do realize that update_node_aff_free() will do something in case masks
was initially NULL?

> 
>> @@ -1008,10 +1016,21 @@ static int cf_check cpu_callback(
>>   {
>>       unsigned int cpu = (unsigned long)hcpu;
>>       int rc = 0;
>> +    static struct cpu_rm_data *mem;
>>   
>>       switch ( action )
>>       {
>>       case CPU_DOWN_FAILED:
>> +        if ( system_state <= SYS_STATE_active )
>> +        {
>> +            if ( mem )
>> +            {
> 
> So, this does compile (and indeed I've tested the result), but I can't
> see how it should.
> 
> mem is guaranteed to be uninitialised at this point, and ...

... it is defined as "static", so it is clearly NULL initially.

> 
>> +                schedule_cpu_rm_free(mem, cpu);
>> +                mem = NULL;
>> +            }
>> +            rc = cpupool_cpu_add(cpu);
>> +        }
>> +        break;
>>       case CPU_ONLINE:
>>           if ( system_state <= SYS_STATE_active )
>>               rc = cpupool_cpu_add(cpu);
>> @@ -1019,12 +1038,31 @@ static int cf_check cpu_callback(
>>       case CPU_DOWN_PREPARE:
>>           /* Suspend/Resume don't change assignments of cpus to cpupools. */
>>           if ( system_state <= SYS_STATE_active )
>> +        {
>>               rc = cpupool_cpu_remove_prologue(cpu);
>> +            if ( !rc )
>> +            {
>> +                ASSERT(!mem);
> 
> ... here, and each subsequent assertion too.
> 
> Given that I tested the patch and it does fix the IRQ assertion, I can
> only imagine that it works by deterministically finding stack rubble
> which happens to be 0.

Not really, as mem isn't on the stack. :-)


Juergen

Andrew Cooper Sept. 1, 2022, 12:01 p.m. UTC | #3

On 01/09/2022 07:11, Juergen Gross wrote:
> On 01.09.22 00:52, Andrew Cooper wrote:
>> On 16/08/2022 11:13, Juergen Gross wrote:
>>> Cpu cpu unplugging is calling schedule_cpu_rm() via stop_machine_run()
>>
>> Cpu cpu.
>>
>>> with interrupts disabled, thus any memory allocation or freeing must
>>> be avoided.
>>>
>>> Since commit 5047cd1d5dea ("xen/common: Use enhanced
>>> ASSERT_ALLOC_CONTEXT in xmalloc()") this restriction is being enforced
>>> via an assertion, which will now fail.
>>>
>>> Before that commit cpu unplugging in normal configurations was working
>>> just by chance as only the cpu performing schedule_cpu_rm() was doing
>>> active work. With core scheduling enabled, however, failures could
>>> result from memory allocations not being properly propagated to other
>>> cpus' TLBs.
>>
>> This isn't accurate, is it?  The problem with initiating a TLB flush
>> with IRQs disabled is that you can deadlock against a remote CPU which
>> is waiting for you to enable IRQs first to take a TLB flush IPI.
>
> As long as only one cpu is trying to allocate/free memory during the
> stop_machine_run() action the deadlock won't happen.
>
>> How does a memory allocation out of the xenheap result in a TLB flush?
>> Even with split heaps, you're only potentially allocating into a new
>> slot which was unused...
>
> Yeah, you are right. The main problem would occur only when a virtual
> address is changed to point at another physical address, which should be
> quite unlikely.
>
> I can drop that paragraph, as it doesn't really help.

Yeah, I think that would be best.

>>
>>> diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
>>> index 58e082eb4c..2506861e4f 100644
>>> --- a/xen/common/sched/cpupool.c
>>> +++ b/xen/common/sched/cpupool.c
>>> @@ -411,22 +411,28 @@ int cpupool_move_domain(struct domain *d,
>>> struct cpupool *c)
>>>   }
>>>     /* Update affinities of all domains in a cpupool. */
>>> -static void cpupool_update_node_affinity(const struct cpupool *c)
>>> +static void cpupool_update_node_affinity(const struct cpupool *c,
>>> +                                         struct affinity_masks *masks)
>>>   {
>>> -    struct affinity_masks masks;
>>> +    struct affinity_masks local_masks;
>>>       struct domain *d;
>>>   -    if ( !update_node_aff_alloc(&masks) )
>>> -        return;
>>> +    if ( !masks )
>>> +    {
>>> +        if ( !update_node_aff_alloc(&local_masks) )
>>> +            return;
>>> +        masks = &local_masks;
>>> +    }
>>>         rcu_read_lock(&domlist_read_lock);
>>>         for_each_domain_in_cpupool(d, c)
>>> -        domain_update_node_aff(d, &masks);
>>> +        domain_update_node_aff(d, masks);
>>>         rcu_read_unlock(&domlist_read_lock);
>>>   -    update_node_aff_free(&masks);
>>> +    if ( masks == &local_masks )
>>> +        update_node_aff_free(masks);
>>>   }
>>>     /*
>>
>> Why do we need this at all?  domain_update_node_aff() already knows what
>> to do when passed NULL, so this seems like an awfully complicated no-op.
>
> You do realize that update_node_aff_free() will do something in case
> masks
> was initially NULL?

By "this", I meant the entire hunk, not just the final if().

What is wrong with passing the (possibly NULL) masks pointer straight
down into domain_update_node_aff() and removing all the memory
allocation in this function?

But I've also answered that by looking more clearly.  It's about not
repeating the memory allocations/freeing for each domain in the pool. 
Which does make sense as this is a slow path already, but the complexity
here of having conditionally allocated masks is far from simple.

>
>>
>>> @@ -1008,10 +1016,21 @@ static int cf_check cpu_callback(
>>>   {
>>>       unsigned int cpu = (unsigned long)hcpu;
>>>       int rc = 0;
>>> +    static struct cpu_rm_data *mem;
>>>         switch ( action )
>>>       {
>>>       case CPU_DOWN_FAILED:
>>> +        if ( system_state <= SYS_STATE_active )
>>> +        {
>>> +            if ( mem )
>>> +            {
>>
>> So, this does compile (and indeed I've tested the result), but I can't
>> see how it should.
>>
>> mem is guaranteed to be uninitialised at this point, and ...
>
> ... it is defined as "static", so it is clearly NULL initially.

Oh, so it is.  That is hiding particularly well in plain sight.

Can it please be this:

@@ -1014,9 +1014,10 @@ void cf_check dump_runq(unsigned char key)
 static int cf_check cpu_callback(
     struct notifier_block *nfb, unsigned long action, void *hcpu)
 {
+    static struct cpu_rm_data *mem; /* Protected by cpu_add_remove_lock */
+
     unsigned int cpu = (unsigned long)hcpu;
     int rc = 0;
-    static struct cpu_rm_data *mem;
 
     switch ( action )
     {

We already split static and non-static variable like this elsewhere for
clarity, and identifying the lock which protects the data is useful for
anyone coming to this in the future.

~Andrew

P.S. as an observation, this isn't safe for parallel AP booting, but I
guarantee that this isn't the only example which would want fixing if we
wanted to get parallel booting working.

Jürgen Groß Sept. 1, 2022, 12:08 p.m. UTC | #4

On 01.09.22 14:01, Andrew Cooper wrote:
> On 01/09/2022 07:11, Juergen Gross wrote:
>> On 01.09.22 00:52, Andrew Cooper wrote:
>>> On 16/08/2022 11:13, Juergen Gross wrote:
>>>> Cpu cpu unplugging is calling schedule_cpu_rm() via stop_machine_run()
>>>
>>> Cpu cpu.
>>>
>>>> with interrupts disabled, thus any memory allocation or freeing must
>>>> be avoided.
>>>>
>>>> Since commit 5047cd1d5dea ("xen/common: Use enhanced
>>>> ASSERT_ALLOC_CONTEXT in xmalloc()") this restriction is being enforced
>>>> via an assertion, which will now fail.
>>>>
>>>> Before that commit cpu unplugging in normal configurations was working
>>>> just by chance as only the cpu performing schedule_cpu_rm() was doing
>>>> active work. With core scheduling enabled, however, failures could
>>>> result from memory allocations not being properly propagated to other
>>>> cpus' TLBs.
>>>
>>> This isn't accurate, is it?  The problem with initiating a TLB flush
>>> with IRQs disabled is that you can deadlock against a remote CPU which
>>> is waiting for you to enable IRQs first to take a TLB flush IPI.
>>
>> As long as only one cpu is trying to allocate/free memory during the
>> stop_machine_run() action the deadlock won't happen.
>>
>>> How does a memory allocation out of the xenheap result in a TLB flush?
>>> Even with split heaps, you're only potentially allocating into a new
>>> slot which was unused...
>>
>> Yeah, you are right. The main problem would occur only when a virtual
>> address is changed to point at another physical address, which should be
>> quite unlikely.
>>
>> I can drop that paragraph, as it doesn't really help.
> 
> Yeah, I think that would be best.
> 
>>>
>>>> diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
>>>> index 58e082eb4c..2506861e4f 100644
>>>> --- a/xen/common/sched/cpupool.c
>>>> +++ b/xen/common/sched/cpupool.c
>>>> @@ -411,22 +411,28 @@ int cpupool_move_domain(struct domain *d,
>>>> struct cpupool *c)
>>>>    }
>>>>      /* Update affinities of all domains in a cpupool. */
>>>> -static void cpupool_update_node_affinity(const struct cpupool *c)
>>>> +static void cpupool_update_node_affinity(const struct cpupool *c,
>>>> +                                         struct affinity_masks *masks)
>>>>    {
>>>> -    struct affinity_masks masks;
>>>> +    struct affinity_masks local_masks;
>>>>        struct domain *d;
>>>>    -    if ( !update_node_aff_alloc(&masks) )
>>>> -        return;
>>>> +    if ( !masks )
>>>> +    {
>>>> +        if ( !update_node_aff_alloc(&local_masks) )
>>>> +            return;
>>>> +        masks = &local_masks;
>>>> +    }
>>>>          rcu_read_lock(&domlist_read_lock);
>>>>          for_each_domain_in_cpupool(d, c)
>>>> -        domain_update_node_aff(d, &masks);
>>>> +        domain_update_node_aff(d, masks);
>>>>          rcu_read_unlock(&domlist_read_lock);
>>>>    -    update_node_aff_free(&masks);
>>>> +    if ( masks == &local_masks )
>>>> +        update_node_aff_free(masks);
>>>>    }
>>>>      /*
>>>
>>> Why do we need this at all?  domain_update_node_aff() already knows what
>>> to do when passed NULL, so this seems like an awfully complicated no-op.
>>
>> You do realize that update_node_aff_free() will do something in case
>> masks
>> was initially NULL?
> 
> By "this", I meant the entire hunk, not just the final if().
> 
> What is wrong with passing the (possibly NULL) masks pointer straight
> down into domain_update_node_aff() and removing all the memory
> allocation in this function?
> 
> But I've also answered that by looking more clearly.  It's about not
> repeating the memory allocations/freeing for each domain in the pool.

Correct.

> Which does make sense as this is a slow path already, but the complexity
> here of having conditionally allocated masks is far from simple.
> 
>>
>>>
>>>> @@ -1008,10 +1016,21 @@ static int cf_check cpu_callback(
>>>>    {
>>>>        unsigned int cpu = (unsigned long)hcpu;
>>>>        int rc = 0;
>>>> +    static struct cpu_rm_data *mem;
>>>>          switch ( action )
>>>>        {
>>>>        case CPU_DOWN_FAILED:
>>>> +        if ( system_state <= SYS_STATE_active )
>>>> +        {
>>>> +            if ( mem )
>>>> +            {
>>>
>>> So, this does compile (and indeed I've tested the result), but I can't
>>> see how it should.
>>>
>>> mem is guaranteed to be uninitialised at this point, and ...
>>
>> ... it is defined as "static", so it is clearly NULL initially.
> 
> Oh, so it is.  That is hiding particularly well in plain sight.
> 
> Can it please be this:
> 
> @@ -1014,9 +1014,10 @@ void cf_check dump_runq(unsigned char key)
>   static int cf_check cpu_callback(
>       struct notifier_block *nfb, unsigned long action, void *hcpu)
>   {
> +    static struct cpu_rm_data *mem; /* Protected by cpu_add_remove_lock */
> +
>       unsigned int cpu = (unsigned long)hcpu;
>       int rc = 0;
> -    static struct cpu_rm_data *mem;
>   
>       switch ( action )
>       {
> 
> We already split static and non-static variable like this elsewhere for
> clarity, and identifying the lock which protects the data is useful for
> anyone coming to this in the future.

Fine with me.

> 
> ~Andrew
> 
> P.S. as an observation, this isn't safe for parallel AP booting, but I
> guarantee that this isn't the only example which would want fixing if we
> wanted to get parallel booting working.

You are aware that mem is used only when removing cpus?


Juergen

diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 228470ac41..ffb2d6202b 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -3247,7 +3247,7 @@  out:
  * by schedule_cpu_rm_alloc() is modified only in case the cpu in question is
  * being moved from or to a cpupool.
  */
-static struct cpu_rm_data *schedule_cpu_rm_alloc(unsigned int cpu)
+struct cpu_rm_data *schedule_cpu_rm_alloc(unsigned int cpu, bool aff_alloc)
 {
     struct cpu_rm_data *data;
     const struct sched_resource *sr;
@@ -3260,6 +3260,17 @@  static struct cpu_rm_data *schedule_cpu_rm_alloc(unsigned int cpu)
     if ( !data )
         goto out;
 
+    if ( aff_alloc )
+    {
+        if ( !update_node_aff_alloc(&data->affinity) )
+        {
+            XFREE(data);
+            goto out;
+        }
+    }
+    else
+        memset(&data->affinity, 0, sizeof(data->affinity));
+
     data->old_ops = sr->scheduler;
     data->vpriv_old = idle_vcpu[cpu]->sched_unit->priv;
     data->ppriv_old = sr->sched_priv;
@@ -3280,6 +3291,7 @@  static struct cpu_rm_data *schedule_cpu_rm_alloc(unsigned int cpu)
         {
             while ( idx > 0 )
                 sched_res_free(&data->sr[--idx]->rcu);
+            update_node_aff_free(&data->affinity);
             XFREE(data);
             goto out;
         }
@@ -3298,10 +3310,11 @@  static struct cpu_rm_data *schedule_cpu_rm_alloc(unsigned int cpu)
     return data;
 }
 
-static void schedule_cpu_rm_free(struct cpu_rm_data *mem, unsigned int cpu)
+void schedule_cpu_rm_free(struct cpu_rm_data *mem, unsigned int cpu)
 {
     sched_free_udata(mem->old_ops, mem->vpriv_old);
     sched_free_pdata(mem->old_ops, mem->ppriv_old, cpu);
+    update_node_aff_free(&mem->affinity);
 
     xfree(mem);
 }
@@ -3312,17 +3325,18 @@  static void schedule_cpu_rm_free(struct cpu_rm_data *mem, unsigned int cpu)
  * The cpu is already marked as "free" and not valid any longer for its
  * cpupool.
  */
-int schedule_cpu_rm(unsigned int cpu)
+int schedule_cpu_rm(unsigned int cpu, struct cpu_rm_data *data)
 {
     struct sched_resource *sr;
-    struct cpu_rm_data *data;
     struct sched_unit *unit;
     spinlock_t *old_lock;
     unsigned long flags;
     int idx = 0;
     unsigned int cpu_iter;
+    bool freemem = !data;
 
-    data = schedule_cpu_rm_alloc(cpu);
+    if ( !data )
+        data = schedule_cpu_rm_alloc(cpu, false);
     if ( !data )
         return -ENOMEM;
 
@@ -3390,7 +3404,8 @@  int schedule_cpu_rm(unsigned int cpu)
     sched_deinit_pdata(data->old_ops, data->ppriv_old, cpu);
 
     rcu_read_unlock(&sched_res_rculock);
-    schedule_cpu_rm_free(data, cpu);
+    if ( freemem )
+        schedule_cpu_rm_free(data, cpu);
 
     return 0;
 }
diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
index 58e082eb4c..2506861e4f 100644
--- a/xen/common/sched/cpupool.c
+++ b/xen/common/sched/cpupool.c
@@ -411,22 +411,28 @@  int cpupool_move_domain(struct domain *d, struct cpupool *c)
 }
 
 /* Update affinities of all domains in a cpupool. */
-static void cpupool_update_node_affinity(const struct cpupool *c)
+static void cpupool_update_node_affinity(const struct cpupool *c,
+                                         struct affinity_masks *masks)
 {
-    struct affinity_masks masks;
+    struct affinity_masks local_masks;
     struct domain *d;
 
-    if ( !update_node_aff_alloc(&masks) )
-        return;
+    if ( !masks )
+    {
+        if ( !update_node_aff_alloc(&local_masks) )
+            return;
+        masks = &local_masks;
+    }
 
     rcu_read_lock(&domlist_read_lock);
 
     for_each_domain_in_cpupool(d, c)
-        domain_update_node_aff(d, &masks);
+        domain_update_node_aff(d, masks);
 
     rcu_read_unlock(&domlist_read_lock);
 
-    update_node_aff_free(&masks);
+    if ( masks == &local_masks )
+        update_node_aff_free(masks);
 }
 
 /*
@@ -460,15 +466,17 @@  static int cpupool_assign_cpu_locked(struct cpupool *c, unsigned int cpu)
 
     rcu_read_unlock(&sched_res_rculock);
 
-    cpupool_update_node_affinity(c);
+    cpupool_update_node_affinity(c, NULL);
 
     return 0;
 }
 
-static int cpupool_unassign_cpu_finish(struct cpupool *c)
+static int cpupool_unassign_cpu_finish(struct cpupool *c,
+                                       struct cpu_rm_data *mem)
 {
     int cpu = cpupool_moving_cpu;
     const cpumask_t *cpus;
+    struct affinity_masks *masks = mem ? &mem->affinity : NULL;
     int ret;
 
     if ( c != cpupool_cpu_moving )
@@ -491,7 +499,7 @@  static int cpupool_unassign_cpu_finish(struct cpupool *c)
      */
     if ( !ret )
     {
-        ret = schedule_cpu_rm(cpu);
+        ret = schedule_cpu_rm(cpu, mem);
         if ( ret )
             cpumask_andnot(&cpupool_free_cpus, &cpupool_free_cpus, cpus);
         else
@@ -503,7 +511,7 @@  static int cpupool_unassign_cpu_finish(struct cpupool *c)
     }
     rcu_read_unlock(&sched_res_rculock);
 
-    cpupool_update_node_affinity(c);
+    cpupool_update_node_affinity(c, masks);
 
     return ret;
 }
@@ -567,7 +575,7 @@  static long cf_check cpupool_unassign_cpu_helper(void *info)
                       cpupool_cpu_moving->cpupool_id, cpupool_moving_cpu);
     spin_lock(&cpupool_lock);
 
-    ret = cpupool_unassign_cpu_finish(c);
+    ret = cpupool_unassign_cpu_finish(c, NULL);
 
     spin_unlock(&cpupool_lock);
     debugtrace_printk("cpupool_unassign_cpu ret=%ld\n", ret);
@@ -714,7 +722,7 @@  static int cpupool_cpu_add(unsigned int cpu)
  * This function is called in stop_machine context, so we can be sure no
  * non-idle vcpu is active on the system.
  */
-static void cpupool_cpu_remove(unsigned int cpu)
+static void cpupool_cpu_remove(unsigned int cpu, struct cpu_rm_data *mem)
 {
     int ret;
 
@@ -722,7 +730,7 @@  static void cpupool_cpu_remove(unsigned int cpu)
 
     if ( !cpumask_test_cpu(cpu, &cpupool_free_cpus) )
     {
-        ret = cpupool_unassign_cpu_finish(cpupool0);
+        ret = cpupool_unassign_cpu_finish(cpupool0, mem);
         BUG_ON(ret);
     }
     cpumask_clear_cpu(cpu, &cpupool_free_cpus);
@@ -788,7 +796,7 @@  static void cpupool_cpu_remove_forced(unsigned int cpu)
         {
             ret = cpupool_unassign_cpu_start(c, master_cpu);
             BUG_ON(ret);
-            ret = cpupool_unassign_cpu_finish(c);
+            ret = cpupool_unassign_cpu_finish(c, NULL);
             BUG_ON(ret);
         }
     }
@@ -1008,10 +1016,21 @@  static int cf_check cpu_callback(
 {
     unsigned int cpu = (unsigned long)hcpu;
     int rc = 0;
+    static struct cpu_rm_data *mem;
 
     switch ( action )
     {
     case CPU_DOWN_FAILED:
+        if ( system_state <= SYS_STATE_active )
+        {
+            if ( mem )
+            {
+                schedule_cpu_rm_free(mem, cpu);
+                mem = NULL;
+            }
+            rc = cpupool_cpu_add(cpu);
+        }
+        break;
     case CPU_ONLINE:
         if ( system_state <= SYS_STATE_active )
             rc = cpupool_cpu_add(cpu);
@@ -1019,12 +1038,31 @@  static int cf_check cpu_callback(
     case CPU_DOWN_PREPARE:
         /* Suspend/Resume don't change assignments of cpus to cpupools. */
         if ( system_state <= SYS_STATE_active )
+        {
             rc = cpupool_cpu_remove_prologue(cpu);
+            if ( !rc )
+            {
+                ASSERT(!mem);
+                mem = schedule_cpu_rm_alloc(cpu, true);
+                rc = mem ? 0 : -ENOMEM;
+            }
+        }
         break;
     case CPU_DYING:
         /* Suspend/Resume don't change assignments of cpus to cpupools. */
         if ( system_state <= SYS_STATE_active )
-            cpupool_cpu_remove(cpu);
+        {
+            ASSERT(mem);
+            cpupool_cpu_remove(cpu, mem);
+        }
+        break;
+    case CPU_DEAD:
+        if ( system_state <= SYS_STATE_active )
+        {
+            ASSERT(mem);
+            schedule_cpu_rm_free(mem, cpu);
+            mem = NULL;
+        }
         break;
     case CPU_RESUME_FAILED:
         cpupool_cpu_remove_forced(cpu);
diff --git a/xen/common/sched/private.h b/xen/common/sched/private.h
index 601d639699..cc7a6cb571 100644
--- a/xen/common/sched/private.h
+++ b/xen/common/sched/private.h
@@ -603,6 +603,7 @@  void update_node_aff_free(struct affinity_masks *affinity);
 
 /* Memory allocation related data for schedule_cpu_rm(). */
 struct cpu_rm_data {
+    struct affinity_masks affinity;
     const struct scheduler *old_ops;
     void *ppriv_old;
     void *vpriv_old;
@@ -617,7 +618,9 @@  struct scheduler *scheduler_alloc(unsigned int sched_id);
 void scheduler_free(struct scheduler *sched);
 int cpu_disable_scheduler(unsigned int cpu);
 int schedule_cpu_add(unsigned int cpu, struct cpupool *c);
-int schedule_cpu_rm(unsigned int cpu);
+struct cpu_rm_data *schedule_cpu_rm_alloc(unsigned int cpu, bool aff_alloc);
+void schedule_cpu_rm_free(struct cpu_rm_data *mem, unsigned int cpu);
+int schedule_cpu_rm(unsigned int cpu, struct cpu_rm_data *mem);
 int sched_move_domain(struct domain *d, struct cpupool *c);
 struct cpupool *cpupool_get_by_id(unsigned int poolid);
 void cpupool_put(struct cpupool *pool);