diff mbox

ARM: exynos_defconfig: disable CONFIG_EXYNOS5420_MCPM; not stable

Message ID CAM4voa==Ram0xXTL=SnKktyRt7VFphQxBJgcf9Yi9Zio0QctHw@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Abhilash Kesavan Nov. 25, 2014, 6:01 a.m. UTC
Hello Kevin,

On Tue, Nov 25, 2014 at 8:50 AM, Kevin Hilman <khilman@kernel.org> wrote:
> On Mon, Nov 24, 2014 at 5:50 PM, Kukjin Kim <kgene@kernel.org> wrote:
>> Olof Johansson wrote:
>>>
>>> On Mon, Nov 24, 2014 at 5:37 PM, Olof Johansson <olof@lixom.net> wrote:
>>> > On Mon, Nov 24, 2014 at 5:35 PM, Kevin Hilman <khilman@kernel.org> wrote:
>>> >> On Mon, Nov 24, 2014 at 4:25 PM, Olof Johansson <olof@lixom.net> wrote:
>>> >>> On Mon, Nov 24, 2014 at 11:51 AM, Kevin Hilman <khilman@kernel.org> wrote:
>>> >>>> Kukjin,
>>> >>>>
>>> >>>> On Mon, Nov 10, 2014 at 11:35 AM, Kevin Hilman <khilman@kernel.org> wrote:
>>> >>>>> Kukjin Kim <kgene@kernel.org> writes:
>>> >>>>>
>>> >>>>>> Kevin Hilman wrote:
>>> >>>>>>>
>>> >>>>>>> From: Kevin Hilman <khilman@linaro.org>
>>> >>>>>>>
>>> >>>>>>> The option CONFIG_EXYNOS5420_MCPM is causing imprecise external aborts
>>> >>>>>>> during boot testing, causing various userspace startup failures.
>>> >>>>>>>
>>> >>>>>>> Disable until it has gotten more testing.
>>> >>>>>>>
>>> >>>>>>> Cc: Kukjin Kim <kgene.kim@samsung.com>,
>>> >>>>>>> Cc: Javier Martinez Canillas <javier.martinez@collabora.co.uk>,
>>> >>>>>>> Cc: Sachin Kamat <sachin.kamat@samsung.com>,
>>> >>>>>>> Cc: Doug Anderson <dianders@chromium.org>,
>>> >>>>>>> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>,
>>> >>>>>>> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>,
>>> >>>>>>> Cc: Tushar Behera <tushar.behera@linaro.org>,
>>> >>>>>>> Cc: stable@vger.kernel.org # v3.17+
>>> >>>>>>> Signed-off-by: Kevin Hilman <khilman@linaro.org>
>>> >>>>>>> ---
>>> >>>>>>> This has been reported by a few people[1], but not investigated or fixed, so it's
>>> >>>>>>> time to disable this feature until it can be fixed.
>>> >>>>>>>
>>> >>>>>> Hi Kevin,
>>> >>>>>>
>>> >>>>>> Yeah I agree with your opinion.
>>> >>>>>>
>>> >>>>>> But as you can see my tree, I've queued regarding mcpm patches for 3.19 will
>>> >>>>>> be shown in -next in this weekend.
>>> >>>>>
>>> >>>>> Which of the recently queued patches are expected to address the
>>> >>>>> imprecise abort issue?  I'd be happy to test them out.
>>> >>>>
>>> >>>> Exynos5 MCPM is still broken in linux-next and still causing an imprecise abort.
>>> >>>>
>>> >>>> What is the status of $SUBJECT patch?
>>> >>>>
>>> >>>>>> Anyway let me apply this into -fixes and
>>> >>>>>> then let's enable after test its functionality in -next in a couple of days.
>>> >>>>>
>>> >>>>> Yes, I think this needs to be applied until these aborts are understood
>>> >>>>> and fixed.
>>> >>>>
>>> >>>> Is anyone at Samsung actually looking into these MCPM issues?
>>> >>>
>>> >>> Hi Kevin,
>>> >>>
>>> >>> What hardware are you having problems with? 5420 or 5422/5800?
>>> >>
>>> >> Yes.  :)
>>> >>
>>> >> exynos5420-arndale-octa:
>>> >> http://storage.armcloud.us/kernel-ci/mainline/v3.18-rc6/arm-exynos_defconfig/boot-exynos5420-
>>> arndale-octa.html
>>> >> exynos5422-odroid-xu3:
>>> >> http://storage.armcloud.us/kernel-ci/mainline/v3.18-rc6/arm-exynos_defconfig/boot-exynos5422-
>>> odroid-xu3.html
>>> >>
>>> >> My boot tests seem to pass fine because I have such a minimal
>>> >> userspace, but Tyler Baker reported that with a "real" userspace, he
>>> >> can't boot to a shell:
>>> >>
>>> >>   http://lists.infradead.org/pipermail/linux-arm-kernel/2014-September/286203.html
>>> >
>> Hmm...his report was in Sep...I think it should be fine with current -next?
>
> No, it is still broken in linux-next (as I stated above.)
>
> Moreover, earlier in this thread you mentioned you were merging some
> MCPM patches that should address this, but did not respond when I
> asked which patches you thing should address this issue
>
>> To be honest, since I don't have the exynos5420 arndale, chromebook...but smdk
>> which has different bootloader, I couldn't test it...I'll try to make a test
>> farm like you guys...
>
> Do you have some colleagues with any other 542x hardware?  I had
> assumed that linux-next was being better tested on the publicaly
> available, and widely available boards like odroid-xu3 and
> Chromebook2, but I've come to realize the hard way that that is not

Are you seeing this on Chromebook2 (Peach-Pi 5800) too ?

> the case.  You mention your board has a different bootloader.  Do you
> suspect there's a bootloader issue on these other platforms?  If so,
> could you elaborate on possible fixes?  I'm more than willing to test
> any proposed fixes, but I'm not familiar enough yet with these SoCs to
> figure out the underlying issues alone.
>
> Until you have a working board farm, you could start having a closer
> look at the boot logs we're already producing.  Admittedly linux-next
> broken in many ways besides this one for exynos currently, but it has
> been having these imprecise aborts well before the other recent
> issues.
>
> Also, It's very possible that this issue is not even MCPM related at
> all, and MCPM is just uncovering a previously hidden bug.  It would be
> very helpful if people more familiar with this hardware and SoC would
> investigate bug reports like these.

The 3 boards I have access to (SMDK5420, Chromebook Peach-Pi and
Chromebook Peach-Pit) work fine with MCPM enabled. I am not sure why
it is failing only on the above mentioned boards as there is nothing
specific to them in the MCPM back-end.

I assume that when you default to platsmp (on disabling MCPM), the
non-working boards boot all cores upto userspace without any issues ?

Based on the timeline (problems started about 2.5 months back), there
have only been a couple of changes in the 5420 MCPM back-end. Could
you revert the following commits and check if things improve.

20fe6f9 ARM: EXYNOS: Support cluster power off on exynos5420/5800
fbb0499 ARM: 8083/1: exynos: activate the CCI on boot CPU/cluster
using the MCPM loopback

These might not revert cleanly, so instead of the above you could also
comment the following 2 lines:





If you still get aborts then I suspect that the problem is with the
bootloader configuration but am not sure. I am OK with disabling
5420_MCPM in the default configuration in such a case. This would
however mean that S2R also stops working by default on 5420.

Regards,
Abhilash
>
> Kevin
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Comments

Kevin Hilman Nov. 26, 2014, 1 a.m. UTC | #1
Hi Abhilash,

Abhilash Kesavan <kesavan.abhilash@gmail.com> writes:

[...]

>>> To be honest, since I don't have the exynos5420 arndale, chromebook...but smdk
>>> which has different bootloader, I couldn't test it...I'll try to make a test
>>> farm like you guys...
>>
>> Do you have some colleagues with any other 542x hardware?  I had
>> assumed that linux-next was being better tested on the publicaly
>> available, and widely available boards like odroid-xu3 and
>> Chromebook2, but I've come to realize the hard way that that is not
>
> Are you seeing this on Chromebook2 (Peach-Pi 5800) too ?

No, it seems that my exynos5800-peach-pi is not having this problem,
which suggests it's a bootloader setup issue.

>> the case.  You mention your board has a different bootloader.  Do you
>> suspect there's a bootloader issue on these other platforms?  If so,
>> could you elaborate on possible fixes?  I'm more than willing to test
>> any proposed fixes, but I'm not familiar enough yet with these SoCs to
>> figure out the underlying issues alone.
>>
>> Until you have a working board farm, you could start having a closer
>> look at the boot logs we're already producing.  Admittedly linux-next
>> broken in many ways besides this one for exynos currently, but it has
>> been having these imprecise aborts well before the other recent
>> issues.
>>
>> Also, It's very possible that this issue is not even MCPM related at
>> all, and MCPM is just uncovering a previously hidden bug.  It would be
>> very helpful if people more familiar with this hardware and SoC would
>> investigate bug reports like these.
>
> The 3 boards I have access to (SMDK5420, Chromebook Peach-Pi and
> Chromebook Peach-Pit) work fine with MCPM enabled. 

Thanks for helping look into this.

> I am not sure why
> it is failing only on the above mentioned boards as there is nothing
> specific to them in the MCPM back-end.
>
> I assume that when you default to platsmp (on disabling MCPM), the
> non-working boards boot all cores upto userspace without any issues ?

Nope.  With MCPM disabled:

  - 5420/arndale-octa: CPU0-3 come up (A15s)
  - 5422/odroid-xu3: only CPU0 (A7)
  - 5800/peach-pi: only CPU0 (A15)

Note that with MCPM enabled, the arndale-octa gets the same result.
Peach-pi on the other hand gets all 8 CPUs, and the odroid-xu3 only gets
6/8 CPUs (see other thread on that topic.)

> Based on the timeline (problems started about 2.5 months back), there
> have only been a couple of changes in the 5420 MCPM back-end. Could
> you revert the following commits and check if things improve.
>
> 20fe6f9 ARM: EXYNOS: Support cluster power off on exynos5420/5800
> fbb0499 ARM: 8083/1: exynos: activate the CCI on boot CPU/cluster
> using the MCPM loopback
>
> These might not revert cleanly, so instead of the above you could also
> comment the following 2 lines:
>
>
> diff --git a/arch/arm/mach-exynos/mcpm-exynos.c
> b/arch/arm/mach-exynos/mcpm-exynos.c
> index dc9a764..9a07188 100644
> --- a/arch/arm/mach-exynos/mcpm-exynos.c
> +++ b/arch/arm/mach-exynos/mcpm-exynos.c
> @@ -152,7 +152,7 @@ static void exynos_power_down(void)
>                 exynos_cpu_power_down(cpunr);
>
>                 if (exynos_cluster_unused(cluster)) {
> -                       exynos_cluster_power_down(cluster);
> +                       //exynos_cluster_power_down(cluster);
>                         last_man = true;
>                 }
2>         } else if (cpu_use_count[cpu][cluster] == 1) {
> @@ -356,8 +356,8 @@ static int __init exynos_mcpm_init(void)
>         ret = mcpm_platform_register(&exynos_power_ops);
>         if (!ret)
>                 ret = mcpm_sync_init(exynos_pm_power_up_setup);
> -       if (!ret)
> -               ret = mcpm_loopback(exynos_cache_off); /* turn on the CCI */
> +       //if (!ret)
> +               //ret = mcpm_loopback(exynos_cache_off); /* turn on the CCI */
>         if (ret) {
>                 iounmap(ns_sram_base_addr);
>                 return ret;
>
>
>
> If you still get aborts then I suspect that the problem is with the
> bootloader configuration but am not sure. 

Nice.  With those lines commented out, the arndale-octa is not geting
imprecise aborts anymore, and this is the platform where those aborts
seem to prevent booting into a full userspace (as originally reported by
Tyler.)

More specifically, with only the loopback call to turn off CCI commented
out, the imprecise aborts go away.

The odroid-xu3 is still getting them, but these seem to happen whether
or not MCPM is enabled, so must a different issue related to the
bootloader setup.

> I am OK with disabling
> 5420_MCPM in the default configuration in such a case. This would
> however mean that S2R also stops working by default on 5420.

Disabling the option isn't my first choice either, I would rather see
this issue debugged and fixed by folks that are more familiar with MCPM
on Exynos.

Kevin
Abhilash Kesavan Nov. 26, 2014, 4:58 p.m. UTC | #2
Hi Kevin,

On Wed, Nov 26, 2014 at 6:30 AM, Kevin Hilman <khilman@kernel.org> wrote:
> Hi Abhilash,
>
> Abhilash Kesavan <kesavan.abhilash@gmail.com> writes:
>
> [...]
>
>>>> To be honest, since I don't have the exynos5420 arndale, chromebook...but smdk
>>>> which has different bootloader, I couldn't test it...I'll try to make a test
>>>> farm like you guys...
>>>
>>> Do you have some colleagues with any other 542x hardware?  I had
>>> assumed that linux-next was being better tested on the publicaly
>>> available, and widely available boards like odroid-xu3 and
>>> Chromebook2, but I've come to realize the hard way that that is not
>>
>> Are you seeing this on Chromebook2 (Peach-Pi 5800) too ?
>
> No, it seems that my exynos5800-peach-pi is not having this problem,
> which suggests it's a bootloader setup issue.
>
>>> the case.  You mention your board has a different bootloader.  Do you
>>> suspect there's a bootloader issue on these other platforms?  If so,
>>> could you elaborate on possible fixes?  I'm more than willing to test
>>> any proposed fixes, but I'm not familiar enough yet with these SoCs to
>>> figure out the underlying issues alone.
>>>
>>> Until you have a working board farm, you could start having a closer
>>> look at the boot logs we're already producing.  Admittedly linux-next
>>> broken in many ways besides this one for exynos currently, but it has
>>> been having these imprecise aborts well before the other recent
>>> issues.
>>>
>>> Also, It's very possible that this issue is not even MCPM related at
>>> all, and MCPM is just uncovering a previously hidden bug.  It would be
>>> very helpful if people more familiar with this hardware and SoC would
>>> investigate bug reports like these.
>>
>> The 3 boards I have access to (SMDK5420, Chromebook Peach-Pi and
>> Chromebook Peach-Pit) work fine with MCPM enabled.
>
> Thanks for helping look into this.
>
>> I am not sure why
>> it is failing only on the above mentioned boards as there is nothing
>> specific to them in the MCPM back-end.
>>
>> I assume that when you default to platsmp (on disabling MCPM), the
>> non-working boards boot all cores upto userspace without any issues ?
>
> Nope.  With MCPM disabled:
>
>   - 5420/arndale-octa: CPU0-3 come up (A15s)
>   - 5422/odroid-xu3: only CPU0 (A7)
>   - 5800/peach-pi: only CPU0 (A15)
>
> Note that with MCPM enabled, the arndale-octa gets the same result.
> Peach-pi on the other hand gets all 8 CPUs, and the odroid-xu3 only gets
> 6/8 CPUs (see other thread on that topic.)
>
>> Based on the timeline (problems started about 2.5 months back), there
>> have only been a couple of changes in the 5420 MCPM back-end. Could
>> you revert the following commits and check if things improve.
>>
>> 20fe6f9 ARM: EXYNOS: Support cluster power off on exynos5420/5800
>> fbb0499 ARM: 8083/1: exynos: activate the CCI on boot CPU/cluster
>> using the MCPM loopback
>>
>> These might not revert cleanly, so instead of the above you could also
>> comment the following 2 lines:
>>
>>
>> diff --git a/arch/arm/mach-exynos/mcpm-exynos.c
>> b/arch/arm/mach-exynos/mcpm-exynos.c
>> index dc9a764..9a07188 100644
>> --- a/arch/arm/mach-exynos/mcpm-exynos.c
>> +++ b/arch/arm/mach-exynos/mcpm-exynos.c
>> @@ -152,7 +152,7 @@ static void exynos_power_down(void)
>>                 exynos_cpu_power_down(cpunr);
>>
>>                 if (exynos_cluster_unused(cluster)) {
>> -                       exynos_cluster_power_down(cluster);
>> +                       //exynos_cluster_power_down(cluster);
>>                         last_man = true;
>>                 }
> 2>         } else if (cpu_use_count[cpu][cluster] == 1) {
>> @@ -356,8 +356,8 @@ static int __init exynos_mcpm_init(void)
>>         ret = mcpm_platform_register(&exynos_power_ops);
>>         if (!ret)
>>                 ret = mcpm_sync_init(exynos_pm_power_up_setup);
>> -       if (!ret)
>> -               ret = mcpm_loopback(exynos_cache_off); /* turn on the CCI */
>> +       //if (!ret)
>> +               //ret = mcpm_loopback(exynos_cache_off); /* turn on the CCI */
>>         if (ret) {
>>                 iounmap(ns_sram_base_addr);
>>                 return ret;
>>
>>
>>
>> If you still get aborts then I suspect that the problem is with the
>> bootloader configuration but am not sure.
>
> Nice.  With those lines commented out, the arndale-octa is not geting
> imprecise aborts anymore, and this is the platform where those aborts
> seem to prevent booting into a full userspace (as originally reported by
> Tyler.)
>
> More specifically, with only the loopback call to turn off CCI commented
> out, the imprecise aborts go away.

I can't see how enabling snoops for the boot cluster is causing these
aborts. Perhaps as Krzysztof commented it has something to do with the
secure firmware/tz software on these boards ? Other than there does
not appear to be any difference between the working/non-working
setups.

Abhilash
>
> The odroid-xu3 is still getting them, but these seem to happen whether
> or not MCPM is enabled, so must a different issue related to the
> bootloader setup.
>
>> I am OK with disabling
>> 5420_MCPM in the default configuration in such a case. This would
>> however mean that S2R also stops working by default on 5420.
>
> Disabling the option isn't my first choice either, I would rather see
> this issue debugged and fixed by folks that are more familiar with MCPM
> on Exynos.
>
> Kevin
Kevin Hilman Nov. 26, 2014, 5:56 p.m. UTC | #3
Abhilash Kesavan <kesavan.abhilash@gmail.com> writes:

> Hi Kevin,
>
> On Wed, Nov 26, 2014 at 6:30 AM, Kevin Hilman <khilman@kernel.org> wrote:
>> Hi Abhilash,
>>
>> Abhilash Kesavan <kesavan.abhilash@gmail.com> writes:
>>
>> [...]
>>
>>>>> To be honest, since I don't have the exynos5420 arndale, chromebook...but smdk
>>>>> which has different bootloader, I couldn't test it...I'll try to make a test
>>>>> farm like you guys...
>>>>
>>>> Do you have some colleagues with any other 542x hardware?  I had
>>>> assumed that linux-next was being better tested on the publicaly
>>>> available, and widely available boards like odroid-xu3 and
>>>> Chromebook2, but I've come to realize the hard way that that is not
>>>
>>> Are you seeing this on Chromebook2 (Peach-Pi 5800) too ?
>>
>> No, it seems that my exynos5800-peach-pi is not having this problem,
>> which suggests it's a bootloader setup issue.
>>
>>>> the case.  You mention your board has a different bootloader.  Do you
>>>> suspect there's a bootloader issue on these other platforms?  If so,
>>>> could you elaborate on possible fixes?  I'm more than willing to test
>>>> any proposed fixes, but I'm not familiar enough yet with these SoCs to
>>>> figure out the underlying issues alone.
>>>>
>>>> Until you have a working board farm, you could start having a closer
>>>> look at the boot logs we're already producing.  Admittedly linux-next
>>>> broken in many ways besides this one for exynos currently, but it has
>>>> been having these imprecise aborts well before the other recent
>>>> issues.
>>>>
>>>> Also, It's very possible that this issue is not even MCPM related at
>>>> all, and MCPM is just uncovering a previously hidden bug.  It would be
>>>> very helpful if people more familiar with this hardware and SoC would
>>>> investigate bug reports like these.
>>>
>>> The 3 boards I have access to (SMDK5420, Chromebook Peach-Pi and
>>> Chromebook Peach-Pit) work fine with MCPM enabled.
>>
>> Thanks for helping look into this.
>>
>>> I am not sure why
>>> it is failing only on the above mentioned boards as there is nothing
>>> specific to them in the MCPM back-end.
>>>
>>> I assume that when you default to platsmp (on disabling MCPM), the
>>> non-working boards boot all cores upto userspace without any issues ?
>>
>> Nope.  With MCPM disabled:
>>
>>   - 5420/arndale-octa: CPU0-3 come up (A15s)
>>   - 5422/odroid-xu3: only CPU0 (A7)
>>   - 5800/peach-pi: only CPU0 (A15)
>>
>> Note that with MCPM enabled, the arndale-octa gets the same result.
>> Peach-pi on the other hand gets all 8 CPUs, and the odroid-xu3 only gets
>> 6/8 CPUs (see other thread on that topic.)
>>
>>> Based on the timeline (problems started about 2.5 months back), there
>>> have only been a couple of changes in the 5420 MCPM back-end. Could
>>> you revert the following commits and check if things improve.
>>>
>>> 20fe6f9 ARM: EXYNOS: Support cluster power off on exynos5420/5800
>>> fbb0499 ARM: 8083/1: exynos: activate the CCI on boot CPU/cluster
>>> using the MCPM loopback
>>>
>>> These might not revert cleanly, so instead of the above you could also
>>> comment the following 2 lines:
>>>
>>>
>>> diff --git a/arch/arm/mach-exynos/mcpm-exynos.c
>>> b/arch/arm/mach-exynos/mcpm-exynos.c
>>> index dc9a764..9a07188 100644
>>> --- a/arch/arm/mach-exynos/mcpm-exynos.c
>>> +++ b/arch/arm/mach-exynos/mcpm-exynos.c
>>> @@ -152,7 +152,7 @@ static void exynos_power_down(void)
>>>                 exynos_cpu_power_down(cpunr);
>>>
>>>                 if (exynos_cluster_unused(cluster)) {
>>> -                       exynos_cluster_power_down(cluster);
>>> +                       //exynos_cluster_power_down(cluster);
>>>                         last_man = true;
>>>                 }
>> 2>         } else if (cpu_use_count[cpu][cluster] == 1) {
>>> @@ -356,8 +356,8 @@ static int __init exynos_mcpm_init(void)
>>>         ret = mcpm_platform_register(&exynos_power_ops);
>>>         if (!ret)
>>>                 ret = mcpm_sync_init(exynos_pm_power_up_setup);
>>> -       if (!ret)
>>> -               ret = mcpm_loopback(exynos_cache_off); /* turn on the CCI */
>>> +       //if (!ret)
>>> +               //ret = mcpm_loopback(exynos_cache_off); /* turn on the CCI */
>>>         if (ret) {
>>>                 iounmap(ns_sram_base_addr);
>>>                 return ret;
>>>
>>>
>>>
>>> If you still get aborts then I suspect that the problem is with the
>>> bootloader configuration but am not sure.
>>
>> Nice.  With those lines commented out, the arndale-octa is not geting
>> imprecise aborts anymore, and this is the platform where those aborts
>> seem to prevent booting into a full userspace (as originally reported by
>> Tyler.)
>>
>> More specifically, with only the loopback call to turn off CCI commented
>> out, the imprecise aborts go away.
>
> I can't see how enabling snoops for the boot cluster is causing these
> aborts. Perhaps as Krzysztof commented it has something to do with the
> secure firmware/tz software on these boards ? Other than there does
> not appear to be any difference between the working/non-working
> setups.

Perhaps the secure firmware is preventing the CCI to be enabled by the
kernel, and that is causing the imprecise abort?

Is there a way to update/replace the BL1/BL2/TZ firmware blobs with
something that is known to be working better?  

Kevin
kgene@kernel.org Nov. 26, 2014, 6:11 p.m. UTC | #4
On 11/27/14 02:56, Kevin Hilman wrote:
> Abhilash Kesavan <kesavan.abhilash@gmail.com> writes:
> 
>> Hi Kevin,
>>
>> On Wed, Nov 26, 2014 at 6:30 AM, Kevin Hilman <khilman@kernel.org> wrote:
>>> Hi Abhilash,
>>>
>>> Abhilash Kesavan <kesavan.abhilash@gmail.com> writes:
>>>
>>> [...]
>>>
>>>>>> To be honest, since I don't have the exynos5420 arndale, chromebook...but smdk
>>>>>> which has different bootloader, I couldn't test it...I'll try to make a test
>>>>>> farm like you guys...
>>>>>
>>>>> Do you have some colleagues with any other 542x hardware?  I had
>>>>> assumed that linux-next was being better tested on the publicaly
>>>>> available, and widely available boards like odroid-xu3 and
>>>>> Chromebook2, but I've come to realize the hard way that that is not
>>>>
>>>> Are you seeing this on Chromebook2 (Peach-Pi 5800) too ?
>>>
>>> No, it seems that my exynos5800-peach-pi is not having this problem,
>>> which suggests it's a bootloader setup issue.
>>>
>>>>> the case.  You mention your board has a different bootloader.  Do you
>>>>> suspect there's a bootloader issue on these other platforms?  If so,
>>>>> could you elaborate on possible fixes?  I'm more than willing to test
>>>>> any proposed fixes, but I'm not familiar enough yet with these SoCs to
>>>>> figure out the underlying issues alone.
>>>>>
>>>>> Until you have a working board farm, you could start having a closer
>>>>> look at the boot logs we're already producing.  Admittedly linux-next
>>>>> broken in many ways besides this one for exynos currently, but it has
>>>>> been having these imprecise aborts well before the other recent
>>>>> issues.
>>>>>
>>>>> Also, It's very possible that this issue is not even MCPM related at
>>>>> all, and MCPM is just uncovering a previously hidden bug.  It would be
>>>>> very helpful if people more familiar with this hardware and SoC would
>>>>> investigate bug reports like these.
>>>>
>>>> The 3 boards I have access to (SMDK5420, Chromebook Peach-Pi and
>>>> Chromebook Peach-Pit) work fine with MCPM enabled.
>>>
>>> Thanks for helping look into this.
>>>
>>>> I am not sure why
>>>> it is failing only on the above mentioned boards as there is nothing
>>>> specific to them in the MCPM back-end.
>>>>
>>>> I assume that when you default to platsmp (on disabling MCPM), the
>>>> non-working boards boot all cores upto userspace without any issues ?
>>>
>>> Nope.  With MCPM disabled:
>>>
>>>   - 5420/arndale-octa: CPU0-3 come up (A15s)
>>>   - 5422/odroid-xu3: only CPU0 (A7)
>>>   - 5800/peach-pi: only CPU0 (A15)
>>>
>>> Note that with MCPM enabled, the arndale-octa gets the same result.
>>> Peach-pi on the other hand gets all 8 CPUs, and the odroid-xu3 only gets
>>> 6/8 CPUs (see other thread on that topic.)
>>>
>>>> Based on the timeline (problems started about 2.5 months back), there
>>>> have only been a couple of changes in the 5420 MCPM back-end. Could
>>>> you revert the following commits and check if things improve.
>>>>
>>>> 20fe6f9 ARM: EXYNOS: Support cluster power off on exynos5420/5800
>>>> fbb0499 ARM: 8083/1: exynos: activate the CCI on boot CPU/cluster
>>>> using the MCPM loopback
>>>>
>>>> These might not revert cleanly, so instead of the above you could also
>>>> comment the following 2 lines:
>>>>
>>>>
>>>> diff --git a/arch/arm/mach-exynos/mcpm-exynos.c
>>>> b/arch/arm/mach-exynos/mcpm-exynos.c
>>>> index dc9a764..9a07188 100644
>>>> --- a/arch/arm/mach-exynos/mcpm-exynos.c
>>>> +++ b/arch/arm/mach-exynos/mcpm-exynos.c
>>>> @@ -152,7 +152,7 @@ static void exynos_power_down(void)
>>>>                 exynos_cpu_power_down(cpunr);
>>>>
>>>>                 if (exynos_cluster_unused(cluster)) {
>>>> -                       exynos_cluster_power_down(cluster);
>>>> +                       //exynos_cluster_power_down(cluster);
>>>>                         last_man = true;
>>>>                 }
>>> 2>         } else if (cpu_use_count[cpu][cluster] == 1) {
>>>> @@ -356,8 +356,8 @@ static int __init exynos_mcpm_init(void)
>>>>         ret = mcpm_platform_register(&exynos_power_ops);
>>>>         if (!ret)
>>>>                 ret = mcpm_sync_init(exynos_pm_power_up_setup);
>>>> -       if (!ret)
>>>> -               ret = mcpm_loopback(exynos_cache_off); /* turn on the CCI */
>>>> +       //if (!ret)
>>>> +               //ret = mcpm_loopback(exynos_cache_off); /* turn on the CCI */
>>>>         if (ret) {
>>>>                 iounmap(ns_sram_base_addr);
>>>>                 return ret;
>>>>
>>>>
>>>>
>>>> If you still get aborts then I suspect that the problem is with the
>>>> bootloader configuration but am not sure.
>>>
>>> Nice.  With those lines commented out, the arndale-octa is not geting
>>> imprecise aborts anymore, and this is the platform where those aborts
>>> seem to prevent booting into a full userspace (as originally reported by
>>> Tyler.)
>>>
>>> More specifically, with only the loopback call to turn off CCI commented
>>> out, the imprecise aborts go away.
>>
>> I can't see how enabling snoops for the boot cluster is causing these
>> aborts. Perhaps as Krzysztof commented it has something to do with the
>> secure firmware/tz software on these boards ? Other than there does
>> not appear to be any difference between the working/non-working
>> setups.
> 
> Perhaps the secure firmware is preventing the CCI to be enabled by the
> kernel, and that is causing the imprecise abort?
> 
> Is there a way to update/replace the BL1/BL2/TZ firmware blobs with
> something that is known to be working better?  
> 
Seems current problem you mentioned is due to different bootloader as I
commented before, but to release bootloader images (bl1, bl2 and so on)
should be handled by board manufacture not SoC vendor I think...even
though the images are provided by vendor for manufacture. To be honest
I'm not sure what procedure should be passed in Samsung side for now
because we including Abhilash are belong to just development team. Need
some time but I can't confirm that...sorry. Let us try.

BTW, Kevin do you know current version for bootloader images on the boards?

Thanks,
Kukjin
Nicolas Pitre Nov. 26, 2014, 6:41 p.m. UTC | #5
On Wed, 26 Nov 2014, Kevin Hilman wrote:

> Abhilash Kesavan <kesavan.abhilash@gmail.com> writes:
> 
> > Hi Kevin,
> >
> > On Wed, Nov 26, 2014 at 6:30 AM, Kevin Hilman <khilman@kernel.org> wrote:
> >> [...]
> >>
> >> More specifically, with only the loopback call to turn off CCI commented
> >> out, the imprecise aborts go away.
> >
> > I can't see how enabling snoops for the boot cluster is causing these
> > aborts. Perhaps as Krzysztof commented it has something to do with the
> > secure firmware/tz software on these boards ? Other than there does
> > not appear to be any difference between the working/non-working
> > setups.
> 
> Perhaps the secure firmware is preventing the CCI to be enabled by the
> kernel, and that is causing the imprecise abort?

That is well possible.

Now...... if the bootloader/firmware does not let Linux deal with both 
the CCI and caches then MCPM simply has no more purpose for this board.  
The whole point of MCPM is actually to handle the CCI properly and the 
most efficient way despite all the possible races and opportunities for 
memory corruptions. And yes, this is a complex task.

So there is actually two choices: the firmware let Linux take care of it 
via the MCPM layer (easy), or the firmware has to implement it all 
_properly_ (hard) behind an interface such as PSCI, at which point MCPM 
should be configured out.

If the firmware does not let Linux interact with the CCI _and_ does not 
implement full MCPM-like services then the platform is broken and only a 
firmware upgrade could fix that.  It might still be possible to boot all 
CPUs through other means, but power management would then be severely 
limited.


Nicolas
Sudeep Holla Nov. 27, 2014, 6:57 p.m. UTC | #6
On 26/11/14 18:41, Nicolas Pitre wrote:
> On Wed, 26 Nov 2014, Kevin Hilman wrote:
>
>> Abhilash Kesavan <kesavan.abhilash@gmail.com> writes:
>>
>>> Hi Kevin,
>>>
>>> On Wed, Nov 26, 2014 at 6:30 AM, Kevin Hilman <khilman@kernel.org> wrote:
>>>> [...]
>>>>
>>>> More specifically, with only the loopback call to turn off CCI commented
>>>> out, the imprecise aborts go away.
>>>
>>> I can't see how enabling snoops for the boot cluster is causing these
>>> aborts. Perhaps as Krzysztof commented it has something to do with the
>>> secure firmware/tz software on these boards ? Other than there does
>>> not appear to be any difference between the working/non-working
>>> setups.
>>
>> Perhaps the secure firmware is preventing the CCI to be enabled by the
>> kernel, and that is causing the imprecise abort?
>
> That is well possible.
>
> Now...... if the bootloader/firmware does not let Linux deal with both
> the CCI and caches then MCPM simply has no more purpose for this board.
> The whole point of MCPM is actually to handle the CCI properly and the
> most efficient way despite all the possible races and opportunities for
> memory corruptions. And yes, this is a complex task.
>
> So there is actually two choices: the firmware let Linux take care of it
> via the MCPM layer (easy), or the firmware has to implement it all
> _properly_ (hard) behind an interface such as PSCI, at which point MCPM
> should be configured out.
>
> If the firmware does not let Linux interact with the CCI _and_ does not
> implement full MCPM-like services then the platform is broken and only a
> firmware upgrade could fix that.  It might still be possible to boot all
> CPUs through other means, but power management would then be severely
> limited.
>

Thanks Nico for the detailed description on the requirements for using
MCPM. This is the kind of issue I was worried in the other thread on
Fijitsu platform. That's the reason I was asking the information about
their secure firmware and what exactly it configures so that we won't
end up with similar situation on there too and definitely not to push
PSCI. I completely agree with you that making a some change in firmware
to give control of CCI to kernel is easy.

Probably if the vendors disagree to apply this small fix to the firmware
we should provide them with *only choice* of PSCI implementation which
is quite complex and easy to get it wrong. That might trigger them to
provide a small fix to use MCPM.

Regards,
Sudeep
diff mbox

Patch

diff --git a/arch/arm/mach-exynos/mcpm-exynos.c
b/arch/arm/mach-exynos/mcpm-exynos.c
index dc9a764..9a07188 100644
--- a/arch/arm/mach-exynos/mcpm-exynos.c
+++ b/arch/arm/mach-exynos/mcpm-exynos.c
@@ -152,7 +152,7 @@  static void exynos_power_down(void)
                exynos_cpu_power_down(cpunr);

                if (exynos_cluster_unused(cluster)) {
-                       exynos_cluster_power_down(cluster);
+                       //exynos_cluster_power_down(cluster);
                        last_man = true;
                }
        } else if (cpu_use_count[cpu][cluster] == 1) {
@@ -356,8 +356,8 @@  static int __init exynos_mcpm_init(void)
        ret = mcpm_platform_register(&exynos_power_ops);
        if (!ret)
                ret = mcpm_sync_init(exynos_pm_power_up_setup);
-       if (!ret)
-               ret = mcpm_loopback(exynos_cache_off); /* turn on the CCI */
+       //if (!ret)
+               //ret = mcpm_loopback(exynos_cache_off); /* turn on the CCI */
        if (ret) {
                iounmap(ns_sram_base_addr);
                return ret;