ARM: kexec: offline non panic CPUs on Kdump panic

Message ID	1374817287-27952-1-git-send-email-vijay.kilari@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: vijay.kilari@gmail.com To: kexec@lists.infradead.org Subject: [PATCH] ARM: kexec: offline non panic CPUs on Kdump panic Date: Fri, 26 Jul 2013 11:11:27 +0530 Message-Id: <1374817287-27952-1-git-send-email-vijay.kilari@gmail.com> Cc: Prasun.Kapoor@caviumnetworks.com, linux@arm.linux.org.uk, Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>, will.deacon@arm.com, linux-arm-kernel@lists.infradead.org Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Vijay Kilari July 26, 2013, 5:41 a.m. UTC

From: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>

In case of normal kexec kernel load, all cpu's are offlined
before calling machine_kexec() under kernel_kexec() function.
But in case crash panic cpus are relaxed in
machine_crash_nonpanic_core() SMP function but not offlined.

When crash kernel is loaded with kexec and on panic trigger
machine_kexec() checks for number of cpus online.
If more than one cpu is online machine_kexec() fails to load
with below error

kexec: error: multiple CPUs still online

In machine_crash_nonpanic_core() SMP function, offline CPU
before cpu_relax

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
---
 arch/arm/kernel/machine_kexec.c |    1 +
 1 file changed, 1 insertion(+)

Will Deacon July 26, 2013, 10:49 a.m. UTC | #1

[Adding Stephen Warren since he has been working in this area]

On Fri, Jul 26, 2013 at 06:41:27AM +0100, vijay.kilari@gmail.com wrote:
> From: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
> 
> In case of normal kexec kernel load, all cpu's are offlined
> before calling machine_kexec() under kernel_kexec() function.
> But in case crash panic cpus are relaxed in
> machine_crash_nonpanic_core() SMP function but not offlined.
> 
> When crash kernel is loaded with kexec and on panic trigger
> machine_kexec() checks for number of cpus online.
> If more than one cpu is online machine_kexec() fails to load
> with below error
> 
> kexec: error: multiple CPUs still online
> 
> In machine_crash_nonpanic_core() SMP function, offline CPU
> before cpu_relax
> 
> Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
> ---
>  arch/arm/kernel/machine_kexec.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
> index 4fb074c..163b160 100644
> --- a/arch/arm/kernel/machine_kexec.c
> +++ b/arch/arm/kernel/machine_kexec.c
> @@ -73,6 +73,7 @@ void machine_crash_nonpanic_core(void *unused)
>  	crash_save_cpu(&regs, smp_processor_id());
>  	flush_cache_all();
>  
> +	set_cpu_online(smp_processor_id(), false);
>  	atomic_dec(&waiting_for_crash_ipi);
>  	while (1)
>  		cpu_relax();

Ok, I guess this will work since the new kernel is loaded somewhere higher
in memory and the crashed kernel will stick around, so the non-crashing CPUs
can sit around spinning.

Will

Stephen Warren July 26, 2013, 5:05 p.m. UTC | #2

On 07/25/2013 11:41 PM, vijay.kilari@gmail.com wrote:
> From: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
> 
> In case of normal kexec kernel load, all cpu's are offlined
> before calling machine_kexec() under kernel_kexec() function.

I'm not sure that's true, unless perhaps you have CONFIG_KEXEC_JUMP enabled?

> But in case crash panic cpus are relaxed in
> machine_crash_nonpanic_core() SMP function but not offlined.
> 
> When crash kernel is loaded with kexec and on panic trigger
> machine_kexec() checks for number of cpus online.
> If more than one cpu is online machine_kexec() fails to load
> with below error
> 
> kexec: error: multiple CPUs still online
> 
> In machine_crash_nonpanic_core() SMP function, offline CPU
> before cpu_relax

> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c

> @@ -73,6 +73,7 @@ void machine_crash_nonpanic_core(void *unused)
>  	crash_save_cpu(&regs, smp_processor_id());
>  	flush_cache_all();
>  
> +	set_cpu_online(smp_processor_id(), false);

I'm not familiar with that API, but it looks like it's just setting the
*current* CPU offline. That sounds problematic for two reasons:

1) Setting the current CPU offline sounds like a bad idea; after all,
code is still running on it. Presumably you want to offline all other CPUs.

2) On a dual-CPU system, I guess this will leave a single CPU marked
online, and hence satisfy the test in machine_kexec(). However, on a
quad-core system, won't this just reduce the online CPU count from 4 to
3 and hence the test in machine_kexec() will still fail?

Can't you call disable_nonboot_cpus() from machine_crash_nonpanic_core()
just like machine_shutdown() does?

Stephen Warren July 26, 2013, 5:08 p.m. UTC | #3

On 07/26/2013 04:49 AM, Will Deacon wrote:
> [Adding Stephen Warren since he has been working in this area]
> 
> On Fri, Jul 26, 2013 at 06:41:27AM +0100, vijay.kilari@gmail.com wrote:
>> From: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
>>
>> In case of normal kexec kernel load, all cpu's are offlined
>> before calling machine_kexec() under kernel_kexec() function.
>> But in case crash panic cpus are relaxed in
>> machine_crash_nonpanic_core() SMP function but not offlined.
>>
>> When crash kernel is loaded with kexec and on panic trigger
>> machine_kexec() checks for number of cpus online.
>> If more than one cpu is online machine_kexec() fails to load
>> with below error
>>
>> kexec: error: multiple CPUs still online
>>
>> In machine_crash_nonpanic_core() SMP function, offline CPU
>> before cpu_relax

>> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c

>> @@ -73,6 +73,7 @@ void machine_crash_nonpanic_core(void *unused)
>>  	crash_save_cpu(&regs, smp_processor_id());
>>  	flush_cache_all();
>>  
>> +	set_cpu_online(smp_processor_id(), false);
>>  	atomic_dec(&waiting_for_crash_ipi);
>>  	while (1)
>>  		cpu_relax();
> 
> Ok, I guess this will work since the new kernel is loaded somewhere higher
> in memory and the crashed kernel will stick around, so the non-crashing CPUs
> can sit around spinning.

Does a kernel that's used as the crash kernel guarantee:

* Never to re-use the memory that was used by the previous kernel, so
that the spin loop code/data won't be corrupted, ever, no matter how
long the crash recovery kernel runs.

* Not use SMP, so there's never a need to re-activate the non-boot CPUs,
which might not work if they aren't truly disabled but rather just
running a pin loop?

Will Deacon July 26, 2013, 5:11 p.m. UTC | #4

On Fri, Jul 26, 2013 at 06:08:07PM +0100, Stephen Warren wrote:
> On 07/26/2013 04:49 AM, Will Deacon wrote:
> > [Adding Stephen Warren since he has been working in this area]
> > 
> > On Fri, Jul 26, 2013 at 06:41:27AM +0100, vijay.kilari@gmail.com wrote:
> >> From: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
> >>
> >> In case of normal kexec kernel load, all cpu's are offlined
> >> before calling machine_kexec() under kernel_kexec() function.
> >> But in case crash panic cpus are relaxed in
> >> machine_crash_nonpanic_core() SMP function but not offlined.
> >>
> >> When crash kernel is loaded with kexec and on panic trigger
> >> machine_kexec() checks for number of cpus online.
> >> If more than one cpu is online machine_kexec() fails to load
> >> with below error
> >>
> >> kexec: error: multiple CPUs still online
> >>
> >> In machine_crash_nonpanic_core() SMP function, offline CPU
> >> before cpu_relax
> 
> >> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
> 
> >> @@ -73,6 +73,7 @@ void machine_crash_nonpanic_core(void *unused)
> >>  	crash_save_cpu(&regs, smp_processor_id());
> >>  	flush_cache_all();
> >>  
> >> +	set_cpu_online(smp_processor_id(), false);
> >>  	atomic_dec(&waiting_for_crash_ipi);
> >>  	while (1)
> >>  		cpu_relax();
> > 
> > Ok, I guess this will work since the new kernel is loaded somewhere higher
> > in memory and the crashed kernel will stick around, so the non-crashing CPUs
> > can sit around spinning.
> 
> Does a kernel that's used as the crash kernel guarantee:
> 
> * Never to re-use the memory that was used by the previous kernel, so
> that the spin loop code/data won't be corrupted, ever, no matter how
> long the crash recovery kernel runs.
> 
> * Not use SMP, so there's never a need to re-activate the non-boot CPUs,
> which might not work if they aren't truly disabled but rather just
> running a pin loop?

I *think* this is true, and x86 seems to have code to a similar effect (the
powerpc stuff lost me though). I've never played with crash kernels on SMP
though...

Will

Vijay Kilari July 30, 2013, 10:05 a.m. UTC | #5

On Fri, Jul 26, 2013 at 10:35 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
> On 07/25/2013 11:41 PM, vijay.kilari@gmail.com wrote:
>> From: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
>>
>> In case of normal kexec kernel load, all cpu's are offlined
>> before calling machine_kexec() under kernel_kexec() function.
>
> I'm not sure that's true, unless perhaps you have CONFIG_KEXEC_JUMP enabled?
>
>> But in case crash panic cpus are relaxed in
>> machine_crash_nonpanic_core() SMP function but not offlined.
>>
>> When crash kernel is loaded with kexec and on panic trigger
>> machine_kexec() checks for number of cpus online.
>> If more than one cpu is online machine_kexec() fails to load
>> with below error
>>
>> kexec: error: multiple CPUs still online
>>
>> In machine_crash_nonpanic_core() SMP function, offline CPU
>> before cpu_relax
>
>> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
>
>> @@ -73,6 +73,7 @@ void machine_crash_nonpanic_core(void *unused)
>>       crash_save_cpu(&regs, smp_processor_id());
>>       flush_cache_all();
>>
>> +     set_cpu_online(smp_processor_id(), false);
>
> I'm not familiar with that API, but it looks like it's just setting the
> *current* CPU offline. That sounds problematic for two reasons:
>
> 1) Setting the current CPU offline sounds like a bad idea; after all,
> code is still running on it. Presumably you want to offline all other CPUs.
>
   machine_crash_nonpanic_core() is a SMP call (smp_call_function) .
   Setting cpu offline is called for all other CPUs except the caller.

> 2) On a dual-CPU system, I guess this will leave a single CPU marked
> online, and hence satisfy the test in machine_kexec(). However, on a
> quad-core system, won't this just reduce the online CPU count from 4 to
> 3 and hence the test in machine_kexec() will still fail?
>
   Setting CPU offline is called from SMP call function. So it is
called for all the
   CPU's on the system except on caller CPU

> Can't you call disable_nonboot_cpus() from machine_crash_nonpanic_core()
> just like machine_shutdown() does?
   I thought of using disable_nonboot_cpus(). However crash can happen on
   any CPU. So we have to stop only nonpanic CPUs.
   The other mechanisms I thought to offline CPUs is
    1) Calling __cpu_disable() to put CPU completely offline. However
        platform_cpu_disable() does not allow CPU 0 is disable (crash can happen
        on any core).
    2) Calling machine_halt(). This does not allow smp_send_stop() on
bootable cpu

Vijay Kilari July 30, 2013, 10:37 a.m. UTC | #6

On Fri, Jul 26, 2013 at 10:38 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
> On 07/26/2013 04:49 AM, Will Deacon wrote:
>> [Adding Stephen Warren since he has been working in this area]
>>
>> On Fri, Jul 26, 2013 at 06:41:27AM +0100, vijay.kilari@gmail.com wrote:
>>> From: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
>>>
>>> In case of normal kexec kernel load, all cpu's are offlined
>>> before calling machine_kexec() under kernel_kexec() function.
>>> But in case crash panic cpus are relaxed in
>>> machine_crash_nonpanic_core() SMP function but not offlined.
>>>
>>> When crash kernel is loaded with kexec and on panic trigger
>>> machine_kexec() checks for number of cpus online.
>>> If more than one cpu is online machine_kexec() fails to load
>>> with below error
>>>
>>> kexec: error: multiple CPUs still online
>>>
>>> In machine_crash_nonpanic_core() SMP function, offline CPU
>>> before cpu_relax
>
>>> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
>
>>> @@ -73,6 +73,7 @@ void machine_crash_nonpanic_core(void *unused)
>>>      crash_save_cpu(&regs, smp_processor_id());
>>>      flush_cache_all();
>>>
>>> +    set_cpu_online(smp_processor_id(), false);
>>>      atomic_dec(&waiting_for_crash_ipi);
>>>      while (1)
>>>              cpu_relax();
>>
>> Ok, I guess this will work since the new kernel is loaded somewhere higher
>> in memory and the crashed kernel will stick around, so the non-crashing CPUs
>> can sit around spinning.
>
> Does a kernel that's used as the crash kernel guarantee:
>
> * Never to re-use the memory that was used by the previous kernel, so
> that the spin loop code/data won't be corrupted, ever, no matter how
> long the crash recovery kernel runs.
>
> * Not use SMP, so there's never a need to re-activate the non-boot CPUs,
> which might not work if they aren't truly disabled but rather just
> running a pin loop?

From cat /proc/iomem, normal kernel is executed from (0x80xxxxxx) with crash
kernel reserved 64M@0xa0000000

80000000-bfffffff : System RAM
  80008000-805aeddf : Kernel code
  805e2000-8063e427 : Kernel data
  a0000000-a3ffffff : Crash kernel

crash kernel is loaded to reserved memory location and is executed from there.
I could confirm this from /proc/iomem when crash kernel is running

a0000000-a3efffff : System RAM
  a0008000-a05aeddf : Kernel code
  a05e2000-a063e427 : Kernel data

Stephen Warren July 30, 2013, 4:57 p.m. UTC | #7

On 07/30/2013 04:05 AM, Vijay Kilari wrote:
> On Fri, Jul 26, 2013 at 10:35 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>> On 07/25/2013 11:41 PM, vijay.kilari@gmail.com wrote:
>>> From: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
>>>
>>> In case of normal kexec kernel load, all cpu's are offlined
>>> before calling machine_kexec() under kernel_kexec() function.
>>
>> I'm not sure that's true, unless perhaps you have CONFIG_KEXEC_JUMP enabled?
>>
>>> But in case crash panic cpus are relaxed in
>>> machine_crash_nonpanic_core() SMP function but not offlined.
>>>
>>> When crash kernel is loaded with kexec and on panic trigger
>>> machine_kexec() checks for number of cpus online.
>>> If more than one cpu is online machine_kexec() fails to load
>>> with below error
>>>
>>> kexec: error: multiple CPUs still online
>>>
>>> In machine_crash_nonpanic_core() SMP function, offline CPU
>>> before cpu_relax
>>
>>> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
>>
>>> @@ -73,6 +73,7 @@ void machine_crash_nonpanic_core(void *unused)
>>>       crash_save_cpu(&regs, smp_processor_id());
>>>       flush_cache_all();
>>>
>>> +     set_cpu_online(smp_processor_id(), false);
>>
>> I'm not familiar with that API, but it looks like it's just setting the
>> *current* CPU offline. That sounds problematic for two reasons:
>>
>> 1) Setting the current CPU offline sounds like a bad idea; after all,
>> code is still running on it. Presumably you want to offline all other CPUs.
>>
>    machine_crash_nonpanic_core() is a SMP call (smp_call_function) .
>    Setting cpu offline is called for all other CPUs except the caller.

Ah OK, that's what I was missing. This makes sense then.

Stephen Warren July 30, 2013, 4:59 p.m. UTC | #8

On 07/30/2013 04:37 AM, Vijay Kilari wrote:
> On Fri, Jul 26, 2013 at 10:38 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
...
>> Does a kernel that's used as the crash kernel guarantee:
>>
>> * Never to re-use the memory that was used by the previous kernel, so
>> that the spin loop code/data won't be corrupted, ever, no matter how
>> long the crash recovery kernel runs.
>>
>> * Not use SMP, so there's never a need to re-activate the non-boot CPUs,
>> which might not work if they aren't truly disabled but rather just
>> running a pin loop?
> 
> From cat /proc/iomem, normal kernel is executed from (0x80xxxxxx) with crash
> kernel reserved 64M@0xa0000000
> 
> 80000000-bfffffff : System RAM
>   80008000-805aeddf : Kernel code
>   805e2000-8063e427 : Kernel data
>   a0000000-a3ffffff : Crash kernel
> 
> crash kernel is loaded to reserved memory location and is executed from there.
> I could confirm this from /proc/iomem when crash kernel is running
> 
> a0000000-a3efffff : System RAM
>   a0008000-a05aeddf : Kernel code
>   a05e2000-a063e427 : Kernel data

OK, but in the crash dump kernel, is 80008000..8063e427 reserved as
well, which would guarantee that the spin loop being executed by the
non-crash CPUs won't be corrupted?

Vijay Kilari July 31, 2013, 11:37 a.m. UTC | #9

On Tue, Jul 30, 2013 at 10:29 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
> On 07/30/2013 04:37 AM, Vijay Kilari wrote:
>> On Fri, Jul 26, 2013 at 10:38 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
> ...
>>> Does a kernel that's used as the crash kernel guarantee:
>>>
>>> * Never to re-use the memory that was used by the previous kernel, so
>>> that the spin loop code/data won't be corrupted, ever, no matter how
>>> long the crash recovery kernel runs.
>>>
>>> * Not use SMP, so there's never a need to re-activate the non-boot CPUs,
>>> which might not work if they aren't truly disabled but rather just
>>> running a pin loop?
>>
>> From cat /proc/iomem, normal kernel is executed from (0x80xxxxxx) with crash
>> kernel reserved 64M@0xa0000000
>>
>> 80000000-bfffffff : System RAM
>>   80008000-805aeddf : Kernel code
>>   805e2000-8063e427 : Kernel data
>>   a0000000-a3ffffff : Crash kernel
>>
>> crash kernel is loaded to reserved memory location and is executed from there.
>> I could confirm this from /proc/iomem when crash kernel is running
>>
>> a0000000-a3efffff : System RAM
>>   a0008000-a05aeddf : Kernel code
>>   a05e2000-a063e427 : Kernel data
>
> OK, but in the crash dump kernel, is 80008000..8063e427 reserved as
> well, which would guarantee that the spin loop being executed by the
> non-crash CPUs won't be corrupted?

The crash dump kernel runs from reserved memory area (0xa0000000 - 0xa3effffff).
 So it should not corrupt the memory area of original kernel that was running
 at 0x80000000,where other CPU's are in spin loop.

Stephen Warren July 31, 2013, 5:14 p.m. UTC | #10

On 07/31/2013 05:37 AM, Vijay Kilari wrote:
> On Tue, Jul 30, 2013 at 10:29 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>> On 07/30/2013 04:37 AM, Vijay Kilari wrote:
>>> On Fri, Jul 26, 2013 at 10:38 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>> ...
>>>> Does a kernel that's used as the crash kernel guarantee:
>>>>
>>>> * Never to re-use the memory that was used by the previous kernel, so
>>>> that the spin loop code/data won't be corrupted, ever, no matter how
>>>> long the crash recovery kernel runs.
>>>>
>>>> * Not use SMP, so there's never a need to re-activate the non-boot CPUs,
>>>> which might not work if they aren't truly disabled but rather just
>>>> running a pin loop?
>>>
>>> From cat /proc/iomem, normal kernel is executed from (0x80xxxxxx) with crash
>>> kernel reserved 64M@0xa0000000
>>>
>>> 80000000-bfffffff : System RAM
>>>   80008000-805aeddf : Kernel code
>>>   805e2000-8063e427 : Kernel data
>>>   a0000000-a3ffffff : Crash kernel
>>>
>>> crash kernel is loaded to reserved memory location and is executed from there.
>>> I could confirm this from /proc/iomem when crash kernel is running
>>>
>>> a0000000-a3efffff : System RAM
>>>   a0008000-a05aeddf : Kernel code
>>>   a05e2000-a063e427 : Kernel data
>>
>> OK, but in the crash dump kernel, is 80008000..8063e427 reserved as
>> well, which would guarantee that the spin loop being executed by the
>> non-crash CPUs won't be corrupted?
> 
> The crash dump kernel runs from reserved memory area (0xa0000000 - 0xa3effffff).
>  So it should not corrupt the memory area of original kernel that was running
>  at 0x80000000,where other CPU's are in spin loop.

What about dynamic allocations?

Vijay Kilari Aug. 1, 2013, 1:49 p.m. UTC | #11

On Wed, Jul 31, 2013 at 10:44 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
> On 07/31/2013 05:37 AM, Vijay Kilari wrote:
>> On Tue, Jul 30, 2013 at 10:29 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>>> On 07/30/2013 04:37 AM, Vijay Kilari wrote:
>>>> On Fri, Jul 26, 2013 at 10:38 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>>> ...
>>>>> Does a kernel that's used as the crash kernel guarantee:
>>>>>
>>>>> * Never to re-use the memory that was used by the previous kernel, so
>>>>> that the spin loop code/data won't be corrupted, ever, no matter how
>>>>> long the crash recovery kernel runs.
>>>>>
>>>>> * Not use SMP, so there's never a need to re-activate the non-boot CPUs,
>>>>> which might not work if they aren't truly disabled but rather just
>>>>> running a pin loop?
>>>>
>>>> From cat /proc/iomem, normal kernel is executed from (0x80xxxxxx) with crash
>>>> kernel reserved 64M@0xa0000000
>>>>
>>>> 80000000-bfffffff : System RAM
>>>>   80008000-805aeddf : Kernel code
>>>>   805e2000-8063e427 : Kernel data
>>>>   a0000000-a3ffffff : Crash kernel
>>>>
>>>> crash kernel is loaded to reserved memory location and is executed from there.
>>>> I could confirm this from /proc/iomem when crash kernel is running
>>>>
>>>> a0000000-a3efffff : System RAM
>>>>   a0008000-a05aeddf : Kernel code
>>>>   a05e2000-a063e427 : Kernel data
>>>
>>> OK, but in the crash dump kernel, is 80008000..8063e427 reserved as
>>> well, which would guarantee that the spin loop being executed by the
>>> non-crash CPUs won't be corrupted?
>>
>> The crash dump kernel runs from reserved memory area (0xa0000000 - 0xa3effffff).
>>  So it should not corrupt the memory area of original kernel that was running
>>  at 0x80000000,where other CPU's are in spin loop.
>
> What about dynamic allocations?
>
IMHO, it is the kdump functionality to ensure that it won't corrupt
original kernel's dynamic allocations

Stephen Warren Aug. 1, 2013, 4:25 p.m. UTC | #12

On 08/01/2013 07:49 AM, Vijay Kilari wrote:
> On Wed, Jul 31, 2013 at 10:44 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>> On 07/31/2013 05:37 AM, Vijay Kilari wrote:
>>> On Tue, Jul 30, 2013 at 10:29 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>>>> On 07/30/2013 04:37 AM, Vijay Kilari wrote:
>>>>> On Fri, Jul 26, 2013 at 10:38 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>>>> ...
>>>>>> Does a kernel that's used as the crash kernel guarantee:
>>>>>>
>>>>>> * Never to re-use the memory that was used by the previous kernel, so
>>>>>> that the spin loop code/data won't be corrupted, ever, no matter how
>>>>>> long the crash recovery kernel runs.
>>>>>>
>>>>>> * Not use SMP, so there's never a need to re-activate the non-boot CPUs,
>>>>>> which might not work if they aren't truly disabled but rather just
>>>>>> running a pin loop?
>>>>>
>>>>> From cat /proc/iomem, normal kernel is executed from (0x80xxxxxx) with crash
>>>>> kernel reserved 64M@0xa0000000
>>>>>
>>>>> 80000000-bfffffff : System RAM
>>>>>   80008000-805aeddf : Kernel code
>>>>>   805e2000-8063e427 : Kernel data
>>>>>   a0000000-a3ffffff : Crash kernel
>>>>>
>>>>> crash kernel is loaded to reserved memory location and is executed from there.
>>>>> I could confirm this from /proc/iomem when crash kernel is running
>>>>>
>>>>> a0000000-a3efffff : System RAM
>>>>>   a0008000-a05aeddf : Kernel code
>>>>>   a05e2000-a063e427 : Kernel data
>>>>
>>>> OK, but in the crash dump kernel, is 80008000..8063e427 reserved as
>>>> well, which would guarantee that the spin loop being executed by the
>>>> non-crash CPUs won't be corrupted?
>>>
>>> The crash dump kernel runs from reserved memory area (0xa0000000 - 0xa3effffff).
>>>  So it should not corrupt the memory area of original kernel that was running
>>>  at 0x80000000,where other CPU's are in spin loop.
>>
>> What about dynamic allocations?
>
> IMHO, it is the kdump functionality to ensure that it won't corrupt
> original kernel's dynamic allocations

OK, if there are explicit measure to assure this already, then there's
no issue.

Vijay Kilari Aug. 12, 2013, 12:18 p.m. UTC | #13

On Thu, Aug 1, 2013 at 9:55 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
> On 08/01/2013 07:49 AM, Vijay Kilari wrote:
>> On Wed, Jul 31, 2013 at 10:44 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>>> On 07/31/2013 05:37 AM, Vijay Kilari wrote:
>>>> On Tue, Jul 30, 2013 at 10:29 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>>>>> On 07/30/2013 04:37 AM, Vijay Kilari wrote:
>>>>>> On Fri, Jul 26, 2013 at 10:38 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
>>>>> ...
>>>>>>> Does a kernel that's used as the crash kernel guarantee:
>>>>>>>
>>>>>>> * Never to re-use the memory that was used by the previous kernel, so
>>>>>>> that the spin loop code/data won't be corrupted, ever, no matter how
>>>>>>> long the crash recovery kernel runs.
>>>>>>>
>>>>>>> * Not use SMP, so there's never a need to re-activate the non-boot CPUs,
>>>>>>> which might not work if they aren't truly disabled but rather just
>>>>>>> running a pin loop?
>>>>>>
>>>>>> From cat /proc/iomem, normal kernel is executed from (0x80xxxxxx) with crash
>>>>>> kernel reserved 64M@0xa0000000
>>>>>>
>>>>>> 80000000-bfffffff : System RAM
>>>>>>   80008000-805aeddf : Kernel code
>>>>>>   805e2000-8063e427 : Kernel data
>>>>>>   a0000000-a3ffffff : Crash kernel
>>>>>>
>>>>>> crash kernel is loaded to reserved memory location and is executed from there.
>>>>>> I could confirm this from /proc/iomem when crash kernel is running
>>>>>>
>>>>>> a0000000-a3efffff : System RAM
>>>>>>   a0008000-a05aeddf : Kernel code
>>>>>>   a05e2000-a063e427 : Kernel data
>>>>>
>>>>> OK, but in the crash dump kernel, is 80008000..8063e427 reserved as
>>>>> well, which would guarantee that the spin loop being executed by the
>>>>> non-crash CPUs won't be corrupted?
>>>>
>>>> The crash dump kernel runs from reserved memory area (0xa0000000 - 0xa3effffff).
>>>>  So it should not corrupt the memory area of original kernel that was running
>>>>  at 0x80000000,where other CPU's are in spin loop.
>>>
>>> What about dynamic allocations?
>>
>> IMHO, it is the kdump functionality to ensure that it won't corrupt
>> original kernel's dynamic allocations
>
> OK, if there are explicit measure to assure this already, then there's
> no issue.

Hi Will,

 Can you please consider this patch?

Thanks & Regards
Vijay

Will Deacon Aug. 13, 2013, 11:18 a.m. UTC | #14

On Mon, Aug 12, 2013 at 01:18:38PM +0100, Vijay Kilari wrote:
> On Thu, Aug 1, 2013 at 9:55 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
> > OK, if there are explicit measure to assure this already, then there's
> > no issue.
> 
> Hi Will,
> 
>  Can you please consider this patch?

Assuming that Stephen and I are understanding things correctly, then this
patch seems fine.

Can you put it into Russell's patch system please?

Will

ARM: kexec: offline non panic CPUs on Kdump panic

Commit Message

Comments

Patch