diff mbox

kexec failures on ipq4019

Message ID CAAGQ2nQNQ-aFkcrQHNA6H5TZ1tTovtfO_0Ohfndn9jXy13Hc6A@mail.gmail.com (mailing list archive)
State Not Applicable, archived
Delegated to: Andy Gross
Headers show

Commit Message

Andy Strohman June 16, 2018, 1:07 a.m. UTC
Hi,

  I'm trying to get kexec to work consistently for ipq4019.  I load
the crash kernel like this:

kexec --type zImage -p zImage-initramfs
--dtb=image-qcom-ipq4019-eap1300.dtb --append="maxcpus=1
reset_devices" --image-size=34419456

  I have reserved 64MB of memory for the crash kernel with parameter:
crashkernel=64M

  This seems to work ~70% of the time. When it doesn't work, I see the
"bye!" message followed by a 5-10 second hang without output.  Then
the machine resets.

  I've been testing with:
echo c > /proc/sysrq-trigger

  Does anyone have an idea of what may be causing the failures or how
to troubleshoot this?

  I'm using OpenWRT with kernel 4.14.37.   I added the following patch
in order to load the crash kernel:

&qcom_smp_kpssv2_ops);


Thanks,

Andy

Comments

Sricharan Ramabadhran June 18, 2018, 6:31 a.m. UTC | #1
Hi Andy,

On 2018-06-16 06:37, Andy Strohman wrote:
> Hi,
> 
>   I'm trying to get kexec to work consistently for ipq4019.  I load
> the crash kernel like this:
> 
> kexec --type zImage -p zImage-initramfs
> --dtb=image-qcom-ipq4019-eap1300.dtb --append="maxcpus=1
> reset_devices" --image-size=34419456
> 
>   I have reserved 64MB of memory for the crash kernel with parameter:
> crashkernel=64M
> 
>   This seems to work ~70% of the time. When it doesn't work, I see the
> "bye!" message followed by a 5-10 second hang without output.  Then
> the machine resets.
> 
>   I've been testing with:
> echo c > /proc/sysrq-trigger
> 
>   Does anyone have an idea of what may be causing the failures or how
> to troubleshoot this?
> 

  I will try to reproduce this and get back to you shortly.

Regards,
  Sricharan

>   I'm using OpenWRT with kernel 4.14.37.   I added the following patch
> in order to load the crash kernel:
> 
> --- a/arch/arm/mach-qcom/platsmp.c
> +++ b/arch/arm/mach-qcom/platsmp.c
> @@ -332,6 +332,12 @@ static void __init qcom_smp_prepare_cpus
>   }
>  }
> 
> +/* Needed by kexec and platform_can_cpu_hotplug() */
> +int qcom_cpu_kill(unsigned int cpu)
> +{
> +    return 1;
> +}
> +
>  static const struct smp_operations smp_msm8660_ops __initconst = {
>   .smp_prepare_cpus = qcom_smp_prepare_cpus,
>   .smp_secondary_init = qcom_secondary_init,
> @@ -358,6 +364,7 @@ static const struct smp_operations qcom_
>   .smp_boot_secondary = kpssv2_boot_secondary,
>  #ifdef CONFIG_HOTPLUG_CPU
>   .cpu_die = qcom_cpu_die,
> +    .cpu_kill       = qcom_cpu_kill,
>  #endif
>  };
>  CPU_METHOD_OF_DECLARE(qcom_smp_kpssv2, "qcom,kpss-acc-v2",
> &qcom_smp_kpssv2_ops);
> 
> 
> Thanks,
> 
> Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Strohman June 26, 2018, 9:17 p.m. UTC | #2
On Sun, Jun 17, 2018 at 11:31 PM,  <sricharan@codeaurora.org> wrote:
> Hi Andy,
>
> On 2018-06-16 06:37, Andy Strohman wrote:
>>
>> Hi,
>>
>>   I'm trying to get kexec to work consistently for ipq4019.  I load
>> the crash kernel like this:
>>
>> kexec --type zImage -p zImage-initramfs
>> --dtb=image-qcom-ipq4019-eap1300.dtb --append="maxcpus=1
>> reset_devices" --image-size=34419456
>>
>>   I have reserved 64MB of memory for the crash kernel with parameter:
>> crashkernel=64M
>>
>>   This seems to work ~70% of the time. When it doesn't work, I see the
>> "bye!" message followed by a 5-10 second hang without output.  Then
>> the machine resets.
>>
>>   I've been testing with:
>> echo c > /proc/sysrq-trigger
>>
>>   Does anyone have an idea of what may be causing the failures or how
>> to troubleshoot this?
>>
>
>  I will try to reproduce this and get back to you shortly.
>
> Regards,
>  Sricharan
>
>
>>   I'm using OpenWRT with kernel 4.14.37.   I added the following patch
>> in order to load the crash kernel:
>>
>> --- a/arch/arm/mach-qcom/platsmp.c
>> +++ b/arch/arm/mach-qcom/platsmp.c
>> @@ -332,6 +332,12 @@ static void __init qcom_smp_prepare_cpus
>>   }
>>  }
>>
>> +/* Needed by kexec and platform_can_cpu_hotplug() */
>> +int qcom_cpu_kill(unsigned int cpu)
>> +{
>> +    return 1;
>> +}
>> +
>>  static const struct smp_operations smp_msm8660_ops __initconst = {
>>   .smp_prepare_cpus = qcom_smp_prepare_cpus,
>>   .smp_secondary_init = qcom_secondary_init,
>> @@ -358,6 +364,7 @@ static const struct smp_operations qcom_
>>   .smp_boot_secondary = kpssv2_boot_secondary,
>>  #ifdef CONFIG_HOTPLUG_CPU
>>   .cpu_die = qcom_cpu_die,
>> +    .cpu_kill       = qcom_cpu_kill,
>>  #endif
>>  };
>>  CPU_METHOD_OF_DECLARE(qcom_smp_kpssv2, "qcom,kpss-acc-v2",
>> &qcom_smp_kpssv2_ops);
>>
>>
>> Thanks,
>>
>> Andy

Hi Sricharan,

  Thanks for your response.  Did you get a chance to try this out?  If
so, were you able to reproduce?

Thanks,

Andy
Sricharan Ramabadhran June 28, 2018, 4:33 a.m. UTC | #3
Hi Andy,

On 6/27/2018 2:47 AM, Andy Strohman wrote:
> On Sun, Jun 17, 2018 at 11:31 PM,  <sricharan@codeaurora.org> wrote:
>> Hi Andy,
>>
>> On 2018-06-16 06:37, Andy Strohman wrote:
>>>
>>> Hi,
>>>
>>>   I'm trying to get kexec to work consistently for ipq4019.  I load
>>> the crash kernel like this:
>>>
>>> kexec --type zImage -p zImage-initramfs
>>> --dtb=image-qcom-ipq4019-eap1300.dtb --append="maxcpus=1
>>> reset_devices" --image-size=34419456
>>>
>>>   I have reserved 64MB of memory for the crash kernel with parameter:
>>> crashkernel=64M
>>>
>>>   This seems to work ~70% of the time. When it doesn't work, I see the
>>> "bye!" message followed by a 5-10 second hang without output.  Then
>>> the machine resets.
>>>
>>>   I've been testing with:
>>> echo c > /proc/sysrq-trigger
>>>
>>>   Does anyone have an idea of what may be causing the failures or how
>>> to troubleshoot this?
>>>
>>
>>  I will try to reproduce this and get back to you shortly.
>>
>> Regards,
>>  Sricharan
>>
>>
>>>   I'm using OpenWRT with kernel 4.14.37.   I added the following patch
>>> in order to load the crash kernel:
>>>
>>> --- a/arch/arm/mach-qcom/platsmp.c
>>> +++ b/arch/arm/mach-qcom/platsmp.c
>>> @@ -332,6 +332,12 @@ static void __init qcom_smp_prepare_cpus
>>>   }
>>>  }
>>>
>>> +/* Needed by kexec and platform_can_cpu_hotplug() */
>>> +int qcom_cpu_kill(unsigned int cpu)
>>> +{
>>> +    return 1;
>>> +}
>>> +
>>>  static const struct smp_operations smp_msm8660_ops __initconst = {
>>>   .smp_prepare_cpus = qcom_smp_prepare_cpus,
>>>   .smp_secondary_init = qcom_secondary_init,
>>> @@ -358,6 +364,7 @@ static const struct smp_operations qcom_
>>>   .smp_boot_secondary = kpssv2_boot_secondary,
>>>  #ifdef CONFIG_HOTPLUG_CPU
>>>   .cpu_die = qcom_cpu_die,
>>> +    .cpu_kill       = qcom_cpu_kill,
>>>  #endif
>>>  };
>>>  CPU_METHOD_OF_DECLARE(qcom_smp_kpssv2, "qcom,kpss-acc-v2",
>>> &qcom_smp_kpssv2_ops);
>>>
>>>
>>> Thanks,
>>>
>>> Andy
> 
> Hi Sricharan,
> 
>   Thanks for your response.  Did you get a chance to try this out?  If
> so, were you able to reproduce?
> 

   I have been trying to kexec while chroot'ing for a different reason.
   I did not observe a issue so far. But that is with a 4.4.60 openwrt kernel.
   Can you point me a link to the kernel that you are trying with ?

Regards,
 Sricharan
Andy Strohman June 28, 2018, 5:25 p.m. UTC | #4
On Wed, Jun 27, 2018 at 9:33 PM, Sricharan R <sricharan@codeaurora.org> wrote:
> Hi Andy,
>
> On 6/27/2018 2:47 AM, Andy Strohman wrote:
>> On Sun, Jun 17, 2018 at 11:31 PM,  <sricharan@codeaurora.org> wrote:
>>> Hi Andy,
>>>
>>> On 2018-06-16 06:37, Andy Strohman wrote:
>>>>
>>>> Hi,
>>>>
>>>>   I'm trying to get kexec to work consistently for ipq4019.  I load
>>>> the crash kernel like this:
>>>>
>>>> kexec --type zImage -p zImage-initramfs
>>>> --dtb=image-qcom-ipq4019-eap1300.dtb --append="maxcpus=1
>>>> reset_devices" --image-size=34419456
>>>>
>>>>   I have reserved 64MB of memory for the crash kernel with parameter:
>>>> crashkernel=64M
>>>>
>>>>   This seems to work ~70% of the time. When it doesn't work, I see the
>>>> "bye!" message followed by a 5-10 second hang without output.  Then
>>>> the machine resets.
>>>>
>>>>   I've been testing with:
>>>> echo c > /proc/sysrq-trigger
>>>>
>>>>   Does anyone have an idea of what may be causing the failures or how
>>>> to troubleshoot this?
>>>>
>>>
>>>  I will try to reproduce this and get back to you shortly.
>>>
>>> Regards,
>>>  Sricharan
>>>
>>>
>>>>   I'm using OpenWRT with kernel 4.14.37.   I added the following patch
>>>> in order to load the crash kernel:
>>>>
>>>> --- a/arch/arm/mach-qcom/platsmp.c
>>>> +++ b/arch/arm/mach-qcom/platsmp.c
>>>> @@ -332,6 +332,12 @@ static void __init qcom_smp_prepare_cpus
>>>>   }
>>>>  }
>>>>
>>>> +/* Needed by kexec and platform_can_cpu_hotplug() */
>>>> +int qcom_cpu_kill(unsigned int cpu)
>>>> +{
>>>> +    return 1;
>>>> +}
>>>> +
>>>>  static const struct smp_operations smp_msm8660_ops __initconst = {
>>>>   .smp_prepare_cpus = qcom_smp_prepare_cpus,
>>>>   .smp_secondary_init = qcom_secondary_init,
>>>> @@ -358,6 +364,7 @@ static const struct smp_operations qcom_
>>>>   .smp_boot_secondary = kpssv2_boot_secondary,
>>>>  #ifdef CONFIG_HOTPLUG_CPU
>>>>   .cpu_die = qcom_cpu_die,
>>>> +    .cpu_kill       = qcom_cpu_kill,
>>>>  #endif
>>>>  };
>>>>  CPU_METHOD_OF_DECLARE(qcom_smp_kpssv2, "qcom,kpss-acc-v2",
>>>> &qcom_smp_kpssv2_ops);
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Andy
>>
>> Hi Sricharan,
>>
>>   Thanks for your response.  Did you get a chance to try this out?  If
>> so, were you able to reproduce?
>>
>
>    I have been trying to kexec while chroot'ing for a different reason.
>    I did not observe a issue so far. But that is with a 4.4.60 openwrt kernel.
>    Can you point me a link to the kernel that you are trying with ?
>
> Regards,
>  Sricharan
>
> --
> "QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

Sricharan,

  I am using https://git.openwrt.org/openwrt/openwrt.git, commit:
ac70ac3532fefa78c944d8a26c8df0ca5d88d04e

  Can you provide me a link to the source that you are using?

  I think my problem is that callback cpu_kill within struct
smp_operations is not properly implemented in
arch/arm/mach-qcom/platsmp.c.  I added a dummy function that just
returns 1 to allow loading the crash kernel.  That is the patch in my
original email in this thread.  I gave this approach a try because I
saw another SUBARCH doing the same, but I think it's inadequate.

  After reading the surrounding code,
https://patchwork.ozlabs.org/patch/207562/ and
https://patchwork.kernel.org/patch/1925071/ , I now believe that I
need to power down CPUs in cpu_kill.  Since I don't have the datasheet
for ipq4019, I'm not sure how to do that.  If you have any advise
regarding this, that would great.

  When I boot the machine with nr_cpus=1,  kexec always works.   It
also seems to work reliably if I taskset the process that triggers the
crash to cpu 0.

Thanks,

Andy
Sricharan Ramabadhran July 2, 2018, 2:06 p.m. UTC | #5
Hi Andy,


>> --
>> "QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
> 
> Sricharan,
> 
>   I am using https://git.openwrt.org/openwrt/openwrt.git, commit:
> ac70ac3532fefa78c944d8a26c8df0ca5d88d04e
> 
>   Can you provide me a link to the source that you are using?
> 
>   I think my problem is that callback cpu_kill within struct
> smp_operations is not properly implemented in
> arch/arm/mach-qcom/platsmp.c.  I added a dummy function that just
> returns 1 to allow loading the crash kernel.  That is the patch in my
> original email in this thread.  I gave this approach a try because I
> saw another SUBARCH doing the same, but I think it's inadequate.
> 
>   After reading the surrounding code,
> https://patchwork.ozlabs.org/patch/207562/ and
> https://patchwork.kernel.org/patch/1925071/ , I now believe that I
> need to power down CPUs in cpu_kill.  Since I don't have the datasheet
> for ipq4019, I'm not sure how to do that.  If you have any advise
> regarding this, that would great.
> 
>   When I boot the machine with nr_cpus=1,  kexec always works.   It
> also seems to work reliably if I taskset the process that triggers the
> crash to cpu 0.
> 

Sorry for the delayed response.

https://source.codeaurora.org/quic/qsdk/oss/kernel/linux-msm/tree/?h=eggplant

Just see that we are always doing a nr_cpus=1 for the kexec kernel and thats
why it was ok.

That said, the current code mach-qcom/platsmp.c does not implement the
cpu_kill callback. Even for a hotplug its just a wfi(). While doing
a wfi() is going to work for hotplug, it would not for the kexec, since
the cpu's were never put it reset state and waking them would simply fail.

That means we need to have the complement to kpssv2_release_secondary
implemented for cpu_kill callback. I will try to write down the exact
sequence from programming guide and give here.

Regards,
 Sricharan
Sricharan Ramabadhran July 4, 2018, 6:36 a.m. UTC | #6
Hi Andy,

> 
> That said, the current code mach-qcom/platsmp.c does not implement the
> cpu_kill callback. Even for a hotplug its just a wfi(). While doing
> a wfi() is going to work for hotplug, it would not for the kexec, since
> the cpu's were never put it reset state and waking them would simply fail.
> 
> That means we need to have the complement to kpssv2_release_secondary
> implemented for cpu_kill callback. I will try to write down the exact
> sequence from programming guide and give here.
> 

Please look at the code in drivers/soc/qcom/spm.c that controls the sequence
of cpu 'c' state during the cpuidle. spm block is the one that takes care
of powerdown/up sequence of the cpu after 'wfi' . Similar thing needs to be
then done for cpu_kill if we expect a 'cpu' to be powercollapsed and to
be brought back during the kexec kernel reboot.

Also, please have a look at
https://source.codeaurora.org/quic/la/kernel/msm-4.4/tree/arch/arm/mach-msm/platsmp.c?h=LA.HB.1.1.5.c1,
old non-dt code that is having the cpu_kill back. When no PM, it simply is a WFI.

Regards,
 Sricharan
Andy Strohman July 10, 2018, 11:40 p.m. UTC | #7
On Tue, Jul 3, 2018 at 11:36 PM, Sricharan R <sricharan@codeaurora.org> wrote:
> Hi Andy,
>
>>
>> That said, the current code mach-qcom/platsmp.c does not implement the
>> cpu_kill callback. Even for a hotplug its just a wfi(). While doing
>> a wfi() is going to work for hotplug, it would not for the kexec, since
>> the cpu's were never put it reset state and waking them would simply fail.
>>
>> That means we need to have the complement to kpssv2_release_secondary
>> implemented for cpu_kill callback. I will try to write down the exact
>> sequence from programming guide and give here.
>>
>
> Please look at the code in drivers/soc/qcom/spm.c that controls the sequence
> of cpu 'c' state during the cpuidle. spm block is the one that takes care
> of powerdown/up sequence of the cpu after 'wfi' . Similar thing needs to be
> then done for cpu_kill if we expect a 'cpu' to be powercollapsed and to
> be brought back during the kexec kernel reboot.
>
> Also, please have a look at
> https://source.codeaurora.org/quic/la/kernel/msm-4.4/tree/arch/arm/mach-msm/platsmp.c?h=LA.HB.1.1.5.c1,
> old non-dt code that is having the cpu_kill back. When no PM, it simply is a WFI.
>
> Regards,
>  Sricharan
>
> --
> "QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

Hi Sricharan,

  I just want to thank you for your help and suggestions.  I wasn't
able to get it working quickly, so I have to move on to other things.
If/when I get this working, I'll be sure to let you know what I did.

Thanks,

Andy
Sricharan Ramabadhran July 11, 2018, 5:26 a.m. UTC | #8
Hi Andy,

On 7/11/2018 5:10 AM, Andy Strohman wrote:
> On Tue, Jul 3, 2018 at 11:36 PM, Sricharan R <sricharan@codeaurora.org> wrote:
>> Hi Andy,
>>
>>>
>>> That said, the current code mach-qcom/platsmp.c does not implement the
>>> cpu_kill callback. Even for a hotplug its just a wfi(). While doing
>>> a wfi() is going to work for hotplug, it would not for the kexec, since
>>> the cpu's were never put it reset state and waking them would simply fail.
>>>
>>> That means we need to have the complement to kpssv2_release_secondary
>>> implemented for cpu_kill callback. I will try to write down the exact
>>> sequence from programming guide and give here.
>>>
>>
>> Please look at the code in drivers/soc/qcom/spm.c that controls the sequence
>> of cpu 'c' state during the cpuidle. spm block is the one that takes care
>> of powerdown/up sequence of the cpu after 'wfi' . Similar thing needs to be
>> then done for cpu_kill if we expect a 'cpu' to be powercollapsed and to
>> be brought back during the kexec kernel reboot.
>>
>> Also, please have a look at
>> https://source.codeaurora.org/quic/la/kernel/msm-4.4/tree/arch/arm/mach-msm/platsmp.c?h=LA.HB.1.1.5.c1,
>> old non-dt code that is having the cpu_kill back. When no PM, it simply is a WFI.
>>
>> Regards,
>>  Sricharan
>>
>> --
>> "QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
> 
> Hi Sricharan,
> 
>   I just want to thank you for your help and suggestions.  I wasn't
> able to get it working quickly, so I have to move on to other things.
> If/when I get this working, I'll be sure to let you know what I did.
> 

 Sure. Me too would try this (kexec without no_cpus=1) and let you know.
 At the moment, little busy with few other things. But would surely mark this
 in my next to-do and comeback.

Regards,
 Sricharan
diff mbox

Patch

--- a/arch/arm/mach-qcom/platsmp.c
+++ b/arch/arm/mach-qcom/platsmp.c
@@ -332,6 +332,12 @@  static void __init qcom_smp_prepare_cpus
  }
 }

+/* Needed by kexec and platform_can_cpu_hotplug() */
+int qcom_cpu_kill(unsigned int cpu)
+{
+    return 1;
+}
+
 static const struct smp_operations smp_msm8660_ops __initconst = {
  .smp_prepare_cpus = qcom_smp_prepare_cpus,
  .smp_secondary_init = qcom_secondary_init,
@@ -358,6 +364,7 @@  static const struct smp_operations qcom_
  .smp_boot_secondary = kpssv2_boot_secondary,
 #ifdef CONFIG_HOTPLUG_CPU
  .cpu_die = qcom_cpu_die,
+    .cpu_kill       = qcom_cpu_kill,
 #endif
 };
 CPU_METHOD_OF_DECLARE(qcom_smp_kpssv2, "qcom,kpss-acc-v2",