Message ID | 1365771875-5788-1-git-send-email-andre.przywara@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 12/04/13 14:04, Andre Przywara wrote: > kvm_target_cpus() checks the compatibility of the used CPU with > KVM, which is currently limited to ARM Cortex-A15 cores. > However by calling it only once on any random CPU it assumes that > all cores are the same, which is not true for big.LITTLE parts. > > After doing about 40 boots on a TC-2 core tile, I found it running > in all but one case on one of the A7 cores (which correctly denied > KVM initialization). On the 39th boot however the code ran on > an A15, leading to a hang after returning success: > > ... > TCP: reno registered > UDP hash table entries: 512 (order: 2, 16384 bytes) > UDP-Lite hash table entries: 512 (order: 2, 16384 bytes) > NET: Registered protocol family 1 > kvm_target_cpu() on CPU #1, part is c0f0 > ... (pause for a while) ... > INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected stalls on CPU > s/tasks: { 1} (detected by 0, t=6002 jiffies, g=4294966999, c=4294966998, q=15) > Task dump for CPU 1: > swapper/0 R running 0 1 0 0x00000002 > > So iterate over every CPU to correctly determine the capability of > the system to run the current KVM implementation. > In case a big.LITTLE configuration is the reason for denial, give > the user a hint how to get it running anyway (maxcpus= on the kernel > command line). > > Please push this still into 3.9. > > Signed-off-by: Andre Przywara <andre.przywara@linaro.org> [...] Nak. The fact that one of the CPUs seem to hang is a sure sign that something is severely broken, and you definitely want to fix that issue, instead of blindly ignoring it. Additionally, it seems you're just papering over the issue. You should be able to exclude the A7 processors, but not completely deny KVM from running on the hardware. M.
On 12 April 2013 14:24, Marc Zyngier <marc.zyngier@arm.com> wrote: > Nak. The fact that one of the CPUs seem to hang is a sure sign that > something is severely broken, and you definitely want to fix that issue, > instead of blindly ignoring it. > > Additionally, it seems you're just papering over the issue. You should > be able to exclude the A7 processors, but not completely deny KVM from > running on the hardware. Well that might be nice, as would fully supporting big.LITTLE systems. But until somebody actually does that work it seems like a better idea to fail gracefully rather than having a 50% chance of failing gracefully and a 50% chance of going weird. -- PMM
On 12/04/13 14:40, Peter Maydell wrote: > On 12 April 2013 14:24, Marc Zyngier <marc.zyngier@arm.com> wrote: >> Nak. The fact that one of the CPUs seem to hang is a sure sign that >> something is severely broken, and you definitely want to fix that issue, >> instead of blindly ignoring it. >> >> Additionally, it seems you're just papering over the issue. You should >> be able to exclude the A7 processors, but not completely deny KVM from >> running on the hardware. > > Well that might be nice, as would fully supporting big.LITTLE > systems. But until somebody actually does that work it seems > like a better idea to fail gracefully rather than having a 50% > chance of failing gracefully and a 50% chance of going weird. Nothing prevents the kernel (or even the user) from forcing the affinity of the CPU threads to the A15s. I'm not saying we should ignore the problem either. Just that the proposed approach is wrong. M.
On 04/12/2013 03:24 PM, Marc Zyngier wrote: > On 12/04/13 14:04, Andre Przywara wrote: >> kvm_target_cpus() checks the compatibility of the used CPU with >> KVM, which is currently limited to ARM Cortex-A15 cores. >> However by calling it only once on any random CPU it assumes that >> all cores are the same, which is not true for big.LITTLE parts. >> >> After doing about 40 boots on a TC-2 core tile, I found it running >> in all but one case on one of the A7 cores (which correctly denied >> KVM initialization). On the 39th boot however the code ran on >> an A15, leading to a hang after returning success: >> >> ... >> TCP: reno registered >> UDP hash table entries: 512 (order: 2, 16384 bytes) >> UDP-Lite hash table entries: 512 (order: 2, 16384 bytes) >> NET: Registered protocol family 1 >> kvm_target_cpu() on CPU #1, part is c0f0 >> ... (pause for a while) ... >> INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected stalls on CPU >> s/tasks: { 1} (detected by 0, t=6002 jiffies, g=4294966999, c=4294966998, q=15) >> Task dump for CPU 1: >> swapper/0 R running 0 1 0 0x00000002 >> >> So iterate over every CPU to correctly determine the capability of >> the system to run the current KVM implementation. >> In case a big.LITTLE configuration is the reason for denial, give >> the user a hint how to get it running anyway (maxcpus= on the kernel >> command line). >> >> Please push this still into 3.9. >> >> Signed-off-by: Andre Przywara <andre.przywara@linaro.org> > > [...] > > Nak. The fact that one of the CPUs seem to hang is a sure sign that > something is severely broken, and you definitely want to fix that issue, > instead of blindly ignoring it. In general I agree, but we should fix it quickly for 3.9, as it can cause a kernel hang. > Additionally, it seems you're just papering over the issue. You should > be able to exclude the A7 processors, but not completely deny KVM from > running on the hardware. Is there an easy way to prevent VCPUs from running on not-supported cores? Something like an kvm_cpu_mask? What would be the wisest hook to utilize for this? It looks like much of the CPU selection code is generic, so this would be non-trivial, right? Regards, Andre.
On 12/04/13 14:58, Andre Przywara wrote: > On 04/12/2013 03:24 PM, Marc Zyngier wrote: >> On 12/04/13 14:04, Andre Przywara wrote: >>> kvm_target_cpus() checks the compatibility of the used CPU with >>> KVM, which is currently limited to ARM Cortex-A15 cores. >>> However by calling it only once on any random CPU it assumes that >>> all cores are the same, which is not true for big.LITTLE parts. >>> >>> After doing about 40 boots on a TC-2 core tile, I found it running >>> in all but one case on one of the A7 cores (which correctly denied >>> KVM initialization). On the 39th boot however the code ran on >>> an A15, leading to a hang after returning success: >>> >>> ... >>> TCP: reno registered >>> UDP hash table entries: 512 (order: 2, 16384 bytes) >>> UDP-Lite hash table entries: 512 (order: 2, 16384 bytes) >>> NET: Registered protocol family 1 >>> kvm_target_cpu() on CPU #1, part is c0f0 >>> ... (pause for a while) ... >>> INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected stalls on CPU >>> s/tasks: { 1} (detected by 0, t=6002 jiffies, g=4294966999, c=4294966998, q=15) >>> Task dump for CPU 1: >>> swapper/0 R running 0 1 0 0x00000002 >>> >>> So iterate over every CPU to correctly determine the capability of >>> the system to run the current KVM implementation. >>> In case a big.LITTLE configuration is the reason for denial, give >>> the user a hint how to get it running anyway (maxcpus= on the kernel >>> command line). >>> >>> Please push this still into 3.9. >>> >>> Signed-off-by: Andre Przywara <andre.przywara@linaro.org> >> >> [...] >> >> Nak. The fact that one of the CPUs seem to hang is a sure sign that >> something is severely broken, and you definitely want to fix that issue, >> instead of blindly ignoring it. > > In general I agree, but we should fix it quickly for 3.9, as it can > cause a kernel hang. Fixing it, certainly. As you have a BL system up and running, maybe you could start by finding out what is this CPU doing, and how it ended up there? >> Additionally, it seems you're just papering over the issue. You should >> be able to exclude the A7 processors, but not completely deny KVM from >> running on the hardware. > > Is there an easy way to prevent VCPUs from running on not-supported > cores? Something like an kvm_cpu_mask? Probably. > What would be the wisest hook to utilize for this? The scheduler has sched_[gs]etaffinity(). if you're allowed to call these though... M.
On Fri, Apr 12, 2013 at 6:24 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 12/04/13 14:04, Andre Przywara wrote: >> kvm_target_cpus() checks the compatibility of the used CPU with >> KVM, which is currently limited to ARM Cortex-A15 cores. >> However by calling it only once on any random CPU it assumes that >> all cores are the same, which is not true for big.LITTLE parts. >> >> After doing about 40 boots on a TC-2 core tile, I found it running >> in all but one case on one of the A7 cores (which correctly denied >> KVM initialization). On the 39th boot however the code ran on >> an A15, leading to a hang after returning success: >> >> ... >> TCP: reno registered >> UDP hash table entries: 512 (order: 2, 16384 bytes) >> UDP-Lite hash table entries: 512 (order: 2, 16384 bytes) >> NET: Registered protocol family 1 >> kvm_target_cpu() on CPU #1, part is c0f0 >> ... (pause for a while) ... >> INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected stalls on CPU >> s/tasks: { 1} (detected by 0, t=6002 jiffies, g=4294966999, c=4294966998, q=15) >> Task dump for CPU 1: >> swapper/0 R running 0 1 0 0x00000002 >> >> So iterate over every CPU to correctly determine the capability of >> the system to run the current KVM implementation. >> In case a big.LITTLE configuration is the reason for denial, give >> the user a hint how to get it running anyway (maxcpus= on the kernel >> command line). >> >> Please push this still into 3.9. >> >> Signed-off-by: Andre Przywara <andre.przywara@linaro.org> > > [...] > > Nak. The fact that one of the CPUs seem to hang is a sure sign that > something is severely broken, and you definitely want to fix that issue, > instead of blindly ignoring it. > > Additionally, it seems you're just papering over the issue. You should > be able to exclude the A7 processors, but not completely deny KVM from > running on the hardware. > Marc, I disagree with this nak. If the current kernel breaks boot on a Big.Little system, we need to take care of that first, and the proposed patch is a quick way to do so, and it does not stand in the way of introducing proper Big.Little support in any way, which I'm sure is going to open up a lot of other interesting questions. I'm going to take this one. -Christoffer
On Sun, 14 Apr 2013 21:57:36 -0700, Christoffer Dall <cdall@cs.columbia.edu> wrote: > On Fri, Apr 12, 2013 at 6:24 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: >> On 12/04/13 14:04, Andre Przywara wrote: >>> kvm_target_cpus() checks the compatibility of the used CPU with >>> KVM, which is currently limited to ARM Cortex-A15 cores. >>> However by calling it only once on any random CPU it assumes that >>> all cores are the same, which is not true for big.LITTLE parts. >>> >>> After doing about 40 boots on a TC-2 core tile, I found it running >>> in all but one case on one of the A7 cores (which correctly denied >>> KVM initialization). On the 39th boot however the code ran on >>> an A15, leading to a hang after returning success: >>> >>> ... >>> TCP: reno registered >>> UDP hash table entries: 512 (order: 2, 16384 bytes) >>> UDP-Lite hash table entries: 512 (order: 2, 16384 bytes) >>> NET: Registered protocol family 1 >>> kvm_target_cpu() on CPU #1, part is c0f0 >>> ... (pause for a while) ... >>> INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected >>> stalls on CPU >>> s/tasks: { 1} (detected by 0, t=6002 jiffies, g=4294966999, >>> c=4294966998, q=15) >>> Task dump for CPU 1: >>> swapper/0 R running 0 1 0 0x00000002 >>> >>> So iterate over every CPU to correctly determine the capability of >>> the system to run the current KVM implementation. >>> In case a big.LITTLE configuration is the reason for denial, give >>> the user a hint how to get it running anyway (maxcpus= on the kernel >>> command line). >>> >>> Please push this still into 3.9. >>> >>> Signed-off-by: Andre Przywara <andre.przywara@linaro.org> >> >> [...] >> >> Nak. The fact that one of the CPUs seem to hang is a sure sign that >> something is severely broken, and you definitely want to fix that issue, >> instead of blindly ignoring it. >> >> Additionally, it seems you're just papering over the issue. You should >> be able to exclude the A7 processors, but not completely deny KVM from >> running on the hardware. >> > Marc, I disagree with this nak. If the current kernel breaks boot on a > Big.Little system, we need to take care of that first, and the > proposed patch is a quick way to do so, and it does not stand in the > way of introducing proper Big.Little support in any way, which I'm > sure is going to open up a lot of other interesting questions. > > I'm going to take this one. That's your privilege. Now, my objections about this patch still stand: - It papers over what looks like a serious bug (CPU hanging? bah, let's not bother with that). It is an A15 hanging here by the way, not an A7. - It forces the user to choose between 5 CPUs and KVM (while simply setting thread affinity would solve the problem this patch tries to solve). - It reports potentially wrong information (setting maxcpus= will probably do the wrong thing if cluster 0 is A7 based). For these reasons, I'm still strongly opposed to this patch being merged. Yes, this is a quick way to hide a (much) bigger problem, but in no way a fix. M.
On Mon, Apr 15, 2013 at 12:50 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On Sun, 14 Apr 2013 21:57:36 -0700, Christoffer Dall > <cdall@cs.columbia.edu> wrote: >> On Fri, Apr 12, 2013 at 6:24 AM, Marc Zyngier <marc.zyngier@arm.com> > wrote: >>> On 12/04/13 14:04, Andre Przywara wrote: >>>> kvm_target_cpus() checks the compatibility of the used CPU with >>>> KVM, which is currently limited to ARM Cortex-A15 cores. >>>> However by calling it only once on any random CPU it assumes that >>>> all cores are the same, which is not true for big.LITTLE parts. >>>> >>>> After doing about 40 boots on a TC-2 core tile, I found it running >>>> in all but one case on one of the A7 cores (which correctly denied >>>> KVM initialization). On the 39th boot however the code ran on >>>> an A15, leading to a hang after returning success: >>>> >>>> ... >>>> TCP: reno registered >>>> UDP hash table entries: 512 (order: 2, 16384 bytes) >>>> UDP-Lite hash table entries: 512 (order: 2, 16384 bytes) >>>> NET: Registered protocol family 1 >>>> kvm_target_cpu() on CPU #1, part is c0f0 >>>> ... (pause for a while) ... >>>> INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected >>>> stalls on CPU >>>> s/tasks: { 1} (detected by 0, t=6002 jiffies, g=4294966999, >>>> c=4294966998, q=15) >>>> Task dump for CPU 1: >>>> swapper/0 R running 0 1 0 0x00000002 >>>> >>>> So iterate over every CPU to correctly determine the capability of >>>> the system to run the current KVM implementation. >>>> In case a big.LITTLE configuration is the reason for denial, give >>>> the user a hint how to get it running anyway (maxcpus= on the kernel >>>> command line). >>>> >>>> Please push this still into 3.9. >>>> >>>> Signed-off-by: Andre Przywara <andre.przywara@linaro.org> >>> >>> [...] >>> >>> Nak. The fact that one of the CPUs seem to hang is a sure sign that >>> something is severely broken, and you definitely want to fix that > issue, >>> instead of blindly ignoring it. >>> >>> Additionally, it seems you're just papering over the issue. You should >>> be able to exclude the A7 processors, but not completely deny KVM from >>> running on the hardware. >>> >> Marc, I disagree with this nak. If the current kernel breaks boot on a >> Big.Little system, we need to take care of that first, and the >> proposed patch is a quick way to do so, and it does not stand in the >> way of introducing proper Big.Little support in any way, which I'm >> sure is going to open up a lot of other interesting questions. >> >> I'm going to take this one. > > That's your privilege. > > Now, my objections about this patch still stand: > - It papers over what looks like a serious bug (CPU hanging? bah, let's > not bother with that). It is an A15 hanging here by the way, not an A7. I missed that part. Are we even sure then that this is related to running on Big.Little? > - It forces the user to choose between 5 CPUs and KVM (while simply > setting thread affinity would solve the problem this patch tries to solve). But this happens during boot, so thread affinity won't really be a valid work around for this... > - It reports potentially wrong information (setting maxcpus= will probably > do the wrong thing if cluster 0 is A7 based). ok, fair enough. > > For these reasons, I'm still strongly opposed to this patch being merged. > Yes, this is a quick way to hide a (much) bigger problem, but in no way a > fix. > I'm not claiming this to be a fix for the underlying problem, and I would much rather see a proper fix. But, I also don't want kernels configured with KVM to cause systems not to boot and if we can prevent that from happening for now, in whichever rough possible way, I think that takes priority. Right now all the code is written with the inherent assumption that we're running on an A15, and while I did not scrutinize if some of our code can break the host if run on an A7, we should probably make sure that doesn't happen until we actively support Big.Little and A7. I think relying on users not to crash their kernels by setting CPU affinities is also a big mistake. Andre, when I look back over the log, I'm actually not seeing definite proof that it's inside the KVM init that the core hangs, how do you determine this? Also, if you only experienced the problem in one boot out of 40, how can you be sure that this patch actually fixes the issue you're seeing?
On Mon, 15 Apr 2013 01:28:25 -0700, Christoffer Dall <cdall@cs.columbia.edu> wrote: > On Mon, Apr 15, 2013 at 12:50 AM, Marc Zyngier <marc.zyngier@arm.com> > wrote: >> On Sun, 14 Apr 2013 21:57:36 -0700, Christoffer Dall >> <cdall@cs.columbia.edu> wrote: >>> On Fri, Apr 12, 2013 at 6:24 AM, Marc Zyngier <marc.zyngier@arm.com> >> wrote: >>>> On 12/04/13 14:04, Andre Przywara wrote: >>>>> kvm_target_cpus() checks the compatibility of the used CPU with >>>>> KVM, which is currently limited to ARM Cortex-A15 cores. >>>>> However by calling it only once on any random CPU it assumes that >>>>> all cores are the same, which is not true for big.LITTLE parts. >>>>> >>>>> After doing about 40 boots on a TC-2 core tile, I found it running >>>>> in all but one case on one of the A7 cores (which correctly denied >>>>> KVM initialization). On the 39th boot however the code ran on >>>>> an A15, leading to a hang after returning success: >>>>> >>>>> ... >>>>> TCP: reno registered >>>>> UDP hash table entries: 512 (order: 2, 16384 bytes) >>>>> UDP-Lite hash table entries: 512 (order: 2, 16384 bytes) >>>>> NET: Registered protocol family 1 >>>>> kvm_target_cpu() on CPU #1, part is c0f0 >>>>> ... (pause for a while) ... >>>>> INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected >>>>> stalls on CPU >>>>> s/tasks: { 1} (detected by 0, t=6002 jiffies, g=4294966999, >>>>> c=4294966998, q=15) >>>>> Task dump for CPU 1: >>>>> swapper/0 R running 0 1 0 0x00000002 >>>>> >>>>> So iterate over every CPU to correctly determine the capability of >>>>> the system to run the current KVM implementation. >>>>> In case a big.LITTLE configuration is the reason for denial, give >>>>> the user a hint how to get it running anyway (maxcpus= on the kernel >>>>> command line). >>>>> >>>>> Please push this still into 3.9. >>>>> >>>>> Signed-off-by: Andre Przywara <andre.przywara@linaro.org> >>>> >>>> [...] >>>> >>>> Nak. The fact that one of the CPUs seem to hang is a sure sign that >>>> something is severely broken, and you definitely want to fix that >> issue, >>>> instead of blindly ignoring it. >>>> >>>> Additionally, it seems you're just papering over the issue. You should >>>> be able to exclude the A7 processors, but not completely deny KVM from >>>> running on the hardware. >>>> >>> Marc, I disagree with this nak. If the current kernel breaks boot on a >>> Big.Little system, we need to take care of that first, and the >>> proposed patch is a quick way to do so, and it does not stand in the >>> way of introducing proper Big.Little support in any way, which I'm >>> sure is going to open up a lot of other interesting questions. >>> >>> I'm going to take this one. >> >> That's your privilege. >> >> Now, my objections about this patch still stand: >> - It papers over what looks like a serious bug (CPU hanging? bah, let's >> not bother with that). It is an A15 hanging here by the way, not an A7. > > I missed that part. Are we even sure then that this is related to > running on Big.Little? The log is way to terse to tell. I asked Andre to investigate this in a separate email (I'm away from my TC2 for a while). >> - It forces the user to choose between 5 CPUs and KVM (while simply >> setting thread affinity would solve the problem this patch tries to >> solve). > > But this happens during boot, so thread affinity won't really be a > valid work around for this... Just refuse to initialize KVM on the A7s, display a warning indicating the *valid* CPUs, and let the user use taskset to run their KVM guest. And/or enforce this in the kernel when hitting vcpu initialisation (using sched_setaffinity). >> - It reports potentially wrong information (setting maxcpus= will >> probably >> do the wrong thing if cluster 0 is A7 based). > > ok, fair enough. > >> >> For these reasons, I'm still strongly opposed to this patch being merged. >> Yes, this is a quick way to hide a (much) bigger problem, but in no way a >> fix. >> > > I'm not claiming this to be a fix for the underlying problem, and I > would much rather see a proper fix. But, I also don't want kernels > configured with KVM to cause systems not to boot and if we can prevent > that from happening for now, in whichever rough possible way, I think > that takes priority. I agree with this. I disagree with the method. > Right now all the code is written with the inherent assumption that > we're running on an A15, and while I did not scrutinize if some of our > code can break the host if run on an A7, we should probably make sure > that doesn't happen until we actively support Big.Little and A7. I > think relying on users not to crash their kernels by setting CPU > affinities is also a big mistake. KVM initialization on A7 should really already work as it is. If it doesn't, then it is a bug that is waiting to occur on A15. And the above hang may just be a sign that the problem is actually occurring already. As for the affinity, we can enforce this very easily. M.
On Mon, Apr 15, 2013 at 1:43 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On Mon, 15 Apr 2013 01:28:25 -0700, Christoffer Dall > <cdall@cs.columbia.edu> wrote: >> On Mon, Apr 15, 2013 at 12:50 AM, Marc Zyngier <marc.zyngier@arm.com> >> wrote: >>> On Sun, 14 Apr 2013 21:57:36 -0700, Christoffer Dall >>> <cdall@cs.columbia.edu> wrote: >>>> On Fri, Apr 12, 2013 at 6:24 AM, Marc Zyngier <marc.zyngier@arm.com> >>> wrote: >>>>> On 12/04/13 14:04, Andre Przywara wrote: >>>>>> kvm_target_cpus() checks the compatibility of the used CPU with >>>>>> KVM, which is currently limited to ARM Cortex-A15 cores. >>>>>> However by calling it only once on any random CPU it assumes that >>>>>> all cores are the same, which is not true for big.LITTLE parts. >>>>>> >>>>>> After doing about 40 boots on a TC-2 core tile, I found it running >>>>>> in all but one case on one of the A7 cores (which correctly denied >>>>>> KVM initialization). On the 39th boot however the code ran on >>>>>> an A15, leading to a hang after returning success: >>>>>> >>>>>> ... >>>>>> TCP: reno registered >>>>>> UDP hash table entries: 512 (order: 2, 16384 bytes) >>>>>> UDP-Lite hash table entries: 512 (order: 2, 16384 bytes) >>>>>> NET: Registered protocol family 1 >>>>>> kvm_target_cpu() on CPU #1, part is c0f0 >>>>>> ... (pause for a while) ... >>>>>> INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected >>>>>> stalls on CPU >>>>>> s/tasks: { 1} (detected by 0, t=6002 jiffies, g=4294966999, >>>>>> c=4294966998, q=15) >>>>>> Task dump for CPU 1: >>>>>> swapper/0 R running 0 1 0 0x00000002 >>>>>> >>>>>> So iterate over every CPU to correctly determine the capability of >>>>>> the system to run the current KVM implementation. >>>>>> In case a big.LITTLE configuration is the reason for denial, give >>>>>> the user a hint how to get it running anyway (maxcpus= on the kernel >>>>>> command line). >>>>>> >>>>>> Please push this still into 3.9. >>>>>> >>>>>> Signed-off-by: Andre Przywara <andre.przywara@linaro.org> >>>>> >>>>> [...] >>>>> >>>>> Nak. The fact that one of the CPUs seem to hang is a sure sign that >>>>> something is severely broken, and you definitely want to fix that >>> issue, >>>>> instead of blindly ignoring it. >>>>> >>>>> Additionally, it seems you're just papering over the issue. You > should >>>>> be able to exclude the A7 processors, but not completely deny KVM > from >>>>> running on the hardware. >>>>> >>>> Marc, I disagree with this nak. If the current kernel breaks boot on a >>>> Big.Little system, we need to take care of that first, and the >>>> proposed patch is a quick way to do so, and it does not stand in the >>>> way of introducing proper Big.Little support in any way, which I'm >>>> sure is going to open up a lot of other interesting questions. >>>> >>>> I'm going to take this one. >>> >>> That's your privilege. >>> >>> Now, my objections about this patch still stand: >>> - It papers over what looks like a serious bug (CPU hanging? bah, let's >>> not bother with that). It is an A15 hanging here by the way, not an A7. >> >> I missed that part. Are we even sure then that this is related to >> running on Big.Little? > > The log is way to terse to tell. I asked Andre to investigate this in a > separate email (I'm away from my TC2 for a while). > >>> - It forces the user to choose between 5 CPUs and KVM (while simply >>> setting thread affinity would solve the problem this patch tries to >>> solve). >> >> But this happens during boot, so thread affinity won't really be a >> valid work around for this... > > Just refuse to initialize KVM on the A7s, display a warning indicating the > *valid* CPUs, and let the user use taskset to run their KVM guest. And/or > enforce this in the kernel when hitting vcpu initialisation (using > sched_setaffinity). > >>> - It reports potentially wrong information (setting maxcpus= will >>> probably >>> do the wrong thing if cluster 0 is A7 based). >> >> ok, fair enough. >> >>> >>> For these reasons, I'm still strongly opposed to this patch being > merged. >>> Yes, this is a quick way to hide a (much) bigger problem, but in no way > a >>> fix. >>> >> >> I'm not claiming this to be a fix for the underlying problem, and I >> would much rather see a proper fix. But, I also don't want kernels >> configured with KVM to cause systems not to boot and if we can prevent >> that from happening for now, in whichever rough possible way, I think >> that takes priority. > > I agree with this. I disagree with the method. > >> Right now all the code is written with the inherent assumption that >> we're running on an A15, and while I did not scrutinize if some of our >> code can break the host if run on an A7, we should probably make sure >> that doesn't happen until we actively support Big.Little and A7. I >> think relying on users not to crash their kernels by setting CPU >> affinities is also a big mistake. > > KVM initialization on A7 should really already work as it is. If it > doesn't, then it is a bug that is waiting to occur on A15. And the above > hang may just be a sign that the problem is actually occurring already. > > As for the affinity, we can enforce this very easily. > OK, but let's not go down this road until we've verified that this is really an issue that needs to be handled by excluding A7s. Andre, can you give us some more details?
On 15 April 2013 09:54, Christoffer Dall <cdall@cs.columbia.edu> wrote: > On Mon, Apr 15, 2013 at 1:43 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: >> As for the affinity, we can enforce this very easily. >> > OK, but let's not go down this road until we've verified that this is > really an issue that needs to be handled by excluding A7s. Andre, can > you give us some more details? We *must* exclude KVM vcpu threads from ever running on A7s, because the cp15 handling code as it stands does not virtualize various registers (most notably the ID/feature registers) and so the vcpu will bounce between "looks like an A15" and "looks like a weird hybrid between A7 and A15". Assuming we don't want to tackle full big.LITTLE host support right now, I can see two options: * forbid KVM completely on a system with any non-A15s * force KVM vcpus to only ever run on the A15s [Q: does all the hyp mode code have to also run on the A15s or is it A7-safe?] -- PMM
On 04/15/2013 11:14 AM, Peter Maydell wrote: > On 15 April 2013 09:54, Christoffer Dall <cdall@cs.columbia.edu> wrote: >> On Mon, Apr 15, 2013 at 1:43 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: >>> As for the affinity, we can enforce this very easily. >>> >> OK, but let's not go down this road until we've verified that this is >> really an issue that needs to be handled by excluding A7s. Andre, can >> you give us some more details? Sure, I am about to collect some data. > > We *must* exclude KVM vcpu threads from ever running on A7s, > because the cp15 handling code as it stands does not virtualize > various registers (most notably the ID/feature registers) and > so the vcpu will bounce between "looks like an A15" and "looks > like a weird hybrid between A7 and A15". Can't we propagate some least common denominator or safe values for those registers instead of propagating the host ones? Or ignore writes to them? Or are there any real showstoppers which I overlooked? Things like cache architecture properties could just be set to some basic values, right? On x86 propagating too much of host CPU information has proven to be problematic, so for instance cache info is very generic. Things are more complicated here, because CPUID is (guest) userland accessible. Eventually the user/qemu should be able to override this, but as a default and for now we should inject one fixed handcrafted CPU regardless of the actual host CPU. If A7 is easier in this regards, this should be considered, too. > Assuming we don't want to tackle full big.LITTLE host support > right now, I can see two options: > * forbid KVM completely on a system with any non-A15s > * force KVM vcpus to only ever run on the A15s I thought a bit about this. It is not so easy to accomplish in a sane way. The CPU is chosen by the scheduler, actually for the QEMU thread. taskset could be of use here, but the affinity could be changed at any time by userland. I thought about returning -EINVAL for the KVM_RUN ioctl if the CPU is not appropriate, but a) QEMU does not know how to handle this properly and b) kvm_arch_vcpu_load currently returns void, so we would need to change it for all KVM architectures. Not sure if its worth to establish this until we fix bL support properly. > [Q: does > all the hyp mode code have to also run on the A15s or > is it A7-safe?] Good point. It seems like the code hangs in init_hyp_mode(). Will do more debugging in there now. Regards, Andre.
On 15 April 2013 10:39, Andre Przywara <andre.przywara@linaro.org> wrote: > On 04/15/2013 11:14 AM, Peter Maydell wrote: >> We *must* exclude KVM vcpu threads from ever running on A7s, >> because the cp15 handling code as it stands does not virtualize >> various registers (most notably the ID/feature registers) and >> so the vcpu will bounce between "looks like an A15" and "looks >> like a weird hybrid between A7 and A15". > > Can't we propagate some least common denominator or safe values for those > registers instead of propagating the host ones? Or ignore writes to them? Or > are there any real showstoppers which I overlooked? Yes, we could do all these things, and will have to; and we need to analyse all the cp15 registers to determine which we need to virtualize and in which of these ways. This is "tackling full big.LITTLE host support" and it is a big job. We need to do it at some point but for the moment we should just make sure we don't do weird and wrong things on an unsupported host config. -- PMM
On 04/15/2013 11:52 AM, Alexander Spyridakis wrote: > On 15 April 2013 11:39, Andre Przywara <andre.przywara@linaro.org > <mailto:andre.przywara@linaro.org>> wrote: > > > [Q: does > > all the hyp mode code have to also run on the A15s or > > is it A7-safe?] > > Good point. It seems like the code hangs in init_hyp_mode(). > Will do more debugging in there now. > > > I've run on this problem before, while trying to run KVM guests on A7 cores. > > For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the > instruction that updates HSCTLR between the two isbs on __do_hyp_init > (mcr p15, 4, r0, c1, c0, 0). If you boot the system with maxcpus=4 then > init_hyp_mode() will not hang on the A7 cluster. Other than that from my > limited testing KVM on A7 works on a usual linux guest. I also tried to > only boot the 3rd A7 core to rule out any racing issues, but still the > same behaviour applies. Could well be the same issue here. I chased it down till CPU 2 goes into HYP mode to do the initialization. I am running with maxcpus=3 (this increases the likelyhood that kvm_target_cpu() runs on an A15), so CPU #2 is the only one A7. As the HYP mode exception table is empty except for the HVC trap, it may be looping here. I am trying now to get the PC of the faulty instruction. (on a TC2: 0-1: A15, 2-4: A7) kvm_target_cpu() on CPU #1, part is c0f0 kvm [1]: entering init_hyp_mode on CPU 1 kvm [1]: kvm_mmu_init() returned 0 on CPU 1 kvm [1]: hyp_get_vectors() finished on CPU 1 kvm [1]: allocated stack for HV mode on 5 CPUs, now on CPU 1 kvm [0]: entering cpu_init_hyp_mode(8050fde0) on CPU 0 kvm [0]: c_i_h_m: set vectors finished kvm [0]: c_i_h_m: initialization of local variables finished kvm [0]: c_i_h_m: kvm_call_hyp() finished, returning... kvm [1]: entering cpu_init_hyp_mode(8050fde0) on CPU 1 kvm [1]: c_i_h_m: set vectors finished kvm [1]: c_i_h_m: initialization of local variables finished kvm [1]: c_i_h_m: kvm_call_hyp() finished, returning... kvm [0]: entering cpu_init_hyp_mode(8050fde0) on CPU 2 kvm [0]: c_i_h_m: set vectors finished kvm [0]: c_i_h_m: initialization of local variables finished Regards, Andre.
On Mon, Apr 15, 2013 at 02:13:55PM +0100, Andre Przywara wrote: > On 04/15/2013 11:52 AM, Alexander Spyridakis wrote: > > I've run on this problem before, while trying to run KVM guests on A7 cores. > > > > For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the > > instruction that updates HSCTLR between the two isbs on __do_hyp_init > > (mcr p15, 4, r0, c1, c0, 0). If you boot the system with maxcpus=4 then > > init_hyp_mode() will not hang on the A7 cluster. Other than that from my > > limited testing KVM on A7 works on a usual linux guest. I also tried to > > only boot the 3rd A7 core to rule out any racing issues, but still the > > same behaviour applies. > > Could well be the same issue here. I chased it down till CPU 2 goes into > HYP mode to do the initialization. > I am running with maxcpus=3 (this increases the likelyhood that > kvm_target_cpu() runs on an A15), so CPU #2 is the only one A7. > As the HYP mode exception table is empty except for the HVC trap, it may > be looping here. I am trying now to get the PC of the faulty instruction. Yes, it sounds like you're taking a recursive fault because the vectors aren't installed yet. Is there any chance you can find out what value you end up writing (or trying to write) to the HSCTLR please? Will
On 04/15/2013 03:48 PM, Will Deacon wrote: > On Mon, Apr 15, 2013 at 02:13:55PM +0100, Andre Przywara wrote: >> On 04/15/2013 11:52 AM, Alexander Spyridakis wrote: >>> I've run on this problem before, while trying to run KVM guests on A7 cores. >>> >>> For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the >>> instruction that updates HSCTLR between the two isbs on __do_hyp_init >>> (mcr p15, 4, r0, c1, c0, 0). If you boot the system with maxcpus=4 then >>> init_hyp_mode() will not hang on the A7 cluster. Other than that from my >>> limited testing KVM on A7 works on a usual linux guest. I also tried to >>> only boot the 3rd A7 core to rule out any racing issues, but still the >>> same behaviour applies. >> >> Could well be the same issue here. I chased it down till CPU 2 goes into >> HYP mode to do the initialization. >> I am running with maxcpus=3 (this increases the likelyhood that >> kvm_target_cpu() runs on an A15), so CPU #2 is the only one A7. >> As the HYP mode exception table is empty except for the HVC trap, it may >> be looping here. I am trying now to get the PC of the faulty instruction. > > Yes, it sounds like you're taking a recursive fault because the vectors > aren't installed yet. Is there any chance you can find out what value you end > up writing (or trying to write) to the HSCTLR please? Mmh, still thinking about a nice way to get some information out of HYP mode. schedule_delayed_work() to monitor a memory address didn't work (too early?). But meanwhile I compared the A7 and A15 TRM 4.3.32 HSCTLR: Bit 21 is fixed 0 on A15, but fixed 1 on A7. Can someone confirm that this is not a typo? Regards, Andre.
On 15 April 2013 15:26, Andre Przywara <andre.przywara@linaro.org> wrote: > But meanwhile I compared the A7 and A15 TRM 4.3.32 HSCTLR: > Bit 21 is fixed 0 on A15, but fixed 1 on A7. > Can someone confirm that this is not a typo? The A7 TRM diagram says '1' but the text says RAZ/WI. The bit should match SCTLR.FI (bit 21), which the A7 TRM definitely says is RAZ/WI. If you have the hardware you can confirm the behaviour yourself by reading the register and seeing what the bit value is :-) -- PMM
On 15 April 2013 16:26, Andre Przywara <andre.przywara@linaro.org> wrote: > Mmh, still thinking about a nice way to get some information out of HYP mode. schedule_delayed_work() to monitor a memory address didn't work (too early?). > > But meanwhile I compared the A7 and A15 TRM 4.3.32 HSCTLR: > Bit 21 is fixed 0 on A15, but fixed 1 on A7. > Can someone confirm that this is not a typo? The HSCTLR value, seemed sane for all cores and it was fixed (read HSCTLR and SCTLR, ORs some bits and updates HSCTLR). If I remember correctly, at the time I was testing it, all cores reported the same value for SCTLR but in the case of the 3rd A7 it would crash as soon as it would try to write on HSCTLR. If you want to continue execution remove one of the isbs, of course the 3rd A7 will immediately hang when you try to run a guest (the rest of the cores will operate normally even when running a guest on them). Regards.
On Sun, 2013-04-14 at 21:57 -0700, Christoffer Dall wrote: > On Fri, Apr 12, 2013 at 6:24 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: > Nak. The fact that one of the CPUs seem to hang is a sure sign that > > something is severely broken, and you definitely want to fix that issue, > > instead of blindly ignoring it. > > > > Additionally, it seems you're just papering over the issue. You should > > be able to exclude the A7 processors, but not completely deny KVM from > > running on the hardware. > > > Marc, I disagree with this nak. If the current kernel breaks boot on a > Big.Little system, we need to take care of that first, and the > proposed patch is a quick way to do so, and it does not stand in the > way of introducing proper Big.Little support in any way, which I'm > sure is going to open up a lot of other interesting questions. > > I'm going to take this one. Since this problem will cause the 3.9 kernel to hang then a workaround like this should go in. There isn't enough time to do a proper fix for 3.9, and even if it could be done I think it would be too intrusive to get merged this late. -Geoff
On Mon, Apr 15, 2013 at 2:52 AM, Alexander Spyridakis <a.spyridakis@virtualopensystems.com> wrote: > On 15 April 2013 11:39, Andre Przywara <andre.przywara@linaro.org> wrote: >> >> > [Q: does >> > all the hyp mode code have to also run on the A15s or >> > is it A7-safe?] >> >> Good point. It seems like the code hangs in init_hyp_mode(). >> Will do more debugging in there now. > > > I've run on this problem before, while trying to run KVM guests on A7 cores. > > For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the instruction > that updates HSCTLR between the two isbs on __do_hyp_init (mcr p15, 4, r0, > c1, c0, 0). If you boot the system with maxcpus=4 then init_hyp_mode() will > not hang on the A7 cluster. Other than that from my limited testing KVM on > A7 works on a usual linux guest. I also tried to only boot the 3rd A7 core > to rule out any racing issues, but still the same behaviour applies. > You lost me on the maxcpus=4 and racing point here. If it works with maxcpus=4 then it works with two A7s, right? Why would running with maxcpus=3 (ie. one A7) rule out any race condition? -Christoffer
On Tue, Apr 16, 2013 at 8:59 AM, Christoffer Dall <cdall@cs.columbia.edu> wrote: > On Mon, Apr 15, 2013 at 2:52 AM, Alexander Spyridakis > <a.spyridakis@virtualopensystems.com> wrote: >> On 15 April 2013 11:39, Andre Przywara <andre.przywara@linaro.org> wrote: >>> >>> > [Q: does >>> > all the hyp mode code have to also run on the A15s or >>> > is it A7-safe?] >>> >>> Good point. It seems like the code hangs in init_hyp_mode(). >>> Will do more debugging in there now. >> >> >> I've run on this problem before, while trying to run KVM guests on A7 cores. >> >> For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the instruction >> that updates HSCTLR between the two isbs on __do_hyp_init (mcr p15, 4, r0, >> c1, c0, 0). If you boot the system with maxcpus=4 then init_hyp_mode() will >> not hang on the A7 cluster. Other than that from my limited testing KVM on >> A7 works on a usual linux guest. I also tried to only boot the 3rd A7 core >> to rule out any racing issues, but still the same behaviour applies. >> > You lost me on the maxcpus=4 and racing point here. If it works with > maxcpus=4 then it works with two A7s, right? Why would running with > maxcpus=3 (ie. one A7) rule out any race condition? > [resending this as someone misspelled Marc's e-mail]
On Mon, Apr 15, 2013 at 5:26 PM, Geoff Levand <geoff@infradead.org> wrote: > On Sun, 2013-04-14 at 21:57 -0700, Christoffer Dall wrote: >> On Fri, Apr 12, 2013 at 6:24 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: >> Nak. The fact that one of the CPUs seem to hang is a sure sign that >> > something is severely broken, and you definitely want to fix that issue, >> > instead of blindly ignoring it. >> > >> > Additionally, it seems you're just papering over the issue. You should >> > be able to exclude the A7 processors, but not completely deny KVM from >> > running on the hardware. >> > >> Marc, I disagree with this nak. If the current kernel breaks boot on a >> Big.Little system, we need to take care of that first, and the >> proposed patch is a quick way to do so, and it does not stand in the >> way of introducing proper Big.Little support in any way, which I'm >> sure is going to open up a lot of other interesting questions. >> >> I'm going to take this one. > > Since this problem will cause the 3.9 kernel to hang then a workaround > like this should go in. There isn't enough time to do a proper fix for > 3.9, and even if it could be done I think it would be too intrusive to > get merged this late. > That's why I was inclined to take the patch, but as Marc pointed out the error message is incorrect, so that should be fixed at the very least. Also I don't think we need the counting logic, just bail out if we have any CPUs that are not supported. Marc, since you're the strongest opponent of this patch, are you still opposed to making sure we don't try to run KVM on Big.Little until support is properly introduced? I also cannot see how we can fix the affinity issue easily from within the kernel, do you have a concrete approach in mind you can share? Note that this is irrespective of the boot delay issue that you guys are observing. -Christoffer
On Mon, Apr 15, 2013 at 6:48 AM, Will Deacon <will.deacon@arm.com> wrote: > On Mon, Apr 15, 2013 at 02:13:55PM +0100, Andre Przywara wrote: >> On 04/15/2013 11:52 AM, Alexander Spyridakis wrote: >> > I've run on this problem before, while trying to run KVM guests on A7 cores. >> > >> > For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the >> > instruction that updates HSCTLR between the two isbs on __do_hyp_init >> > (mcr p15, 4, r0, c1, c0, 0). If you boot the system with maxcpus=4 then >> > init_hyp_mode() will not hang on the A7 cluster. Other than that from my >> > limited testing KVM on A7 works on a usual linux guest. I also tried to >> > only boot the 3rd A7 core to rule out any racing issues, but still the >> > same behaviour applies. >> >> Could well be the same issue here. I chased it down till CPU 2 goes into >> HYP mode to do the initialization. >> I am running with maxcpus=3 (this increases the likelyhood that >> kvm_target_cpu() runs on an A15), so CPU #2 is the only one A7. >> As the HYP mode exception table is empty except for the HVC trap, it may >> be looping here. I am trying now to get the PC of the faulty instruction. > > Yes, it sounds like you're taking a recursive fault because the vectors > aren't installed yet. Is there any chance you can find out what value you end > up writing (or trying to write) to the HSCTLR please? > Actually I'm a little confused, wasn't Andre seeing a halt on an A15 cpu, not an A7? Or is the theory that an A7 locks up and the calling A15 hangs on the SMP call to cpu_init_hyp_mode, waiting for the A7 to complete? -Christoffer
On Tue, 16 Apr 2013 09:26:26 -0700, Christoffer Dall <cdall@cs.columbia.edu> wrote: > On Mon, Apr 15, 2013 at 6:48 AM, Will Deacon <will.deacon@arm.com> wrote: >> On Mon, Apr 15, 2013 at 02:13:55PM +0100, Andre Przywara wrote: >>> On 04/15/2013 11:52 AM, Alexander Spyridakis wrote: >>> > I've run on this problem before, while trying to run KVM guests on A7 >>> > cores. >>> > >>> > For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the >>> > instruction that updates HSCTLR between the two isbs on __do_hyp_init >>> > (mcr p15, 4, r0, c1, c0, 0). If you boot the system with maxcpus=4 >>> > then >>> > init_hyp_mode() will not hang on the A7 cluster. Other than that from >>> > my >>> > limited testing KVM on A7 works on a usual linux guest. I also tried >>> > to >>> > only boot the 3rd A7 core to rule out any racing issues, but still the >>> > same behaviour applies. >>> >>> Could well be the same issue here. I chased it down till CPU 2 goes into >>> HYP mode to do the initialization. >>> I am running with maxcpus=3 (this increases the likelyhood that >>> kvm_target_cpu() runs on an A15), so CPU #2 is the only one A7. >>> As the HYP mode exception table is empty except for the HVC trap, it may >>> be looping here. I am trying now to get the PC of the faulty >>> instruction. >> >> Yes, it sounds like you're taking a recursive fault because the vectors >> aren't installed yet. Is there any chance you can find out what value >> you end >> up writing (or trying to write) to the HSCTLR please? >> > Actually I'm a little confused, wasn't Andre seeing a halt on an A15 > cpu, not an A7? Or is the theory that an A7 locks up and the calling > A15 hangs on the SMP call to cpu_init_hyp_mode, waiting for the A7 to > complete? Yes, A15 hanging, not A7. That's why I'm strongly opposed to this patch. I'm pretty sure the A7s only have a side effect that triggers a kernel bug on the A15 side. Before taking *any* patch around this, we should understand the issue fully, and not start patching random stuff just because Linus is going to tag 3.9. M.
On Tue, 16 Apr 2013 09:24:03 -0700, Christoffer Dall <cdall@cs.columbia.edu> wrote: > On Mon, Apr 15, 2013 at 5:26 PM, Geoff Levand <geoff@infradead.org> wrote: >> On Sun, 2013-04-14 at 21:57 -0700, Christoffer Dall wrote: >>> On Fri, Apr 12, 2013 at 6:24 AM, Marc Zyngier <marc.zyngier@arm.com> >>> wrote: >>> Nak. The fact that one of the CPUs seem to hang is a sure sign that >>> > something is severely broken, and you definitely want to fix that >>> > issue, >>> > instead of blindly ignoring it. >>> > >>> > Additionally, it seems you're just papering over the issue. You should >>> > be able to exclude the A7 processors, but not completely deny KVM from >>> > running on the hardware. >>> > >>> Marc, I disagree with this nak. If the current kernel breaks boot on a >>> Big.Little system, we need to take care of that first, and the >>> proposed patch is a quick way to do so, and it does not stand in the >>> way of introducing proper Big.Little support in any way, which I'm >>> sure is going to open up a lot of other interesting questions. >>> >>> I'm going to take this one. >> >> Since this problem will cause the 3.9 kernel to hang then a workaround >> like this should go in. There isn't enough time to do a proper fix for >> 3.9, and even if it could be done I think it would be too intrusive to >> get merged this late. >> > That's why I was inclined to take the patch, but as Marc pointed out > the error message is incorrect, so that should be fixed at the very > least. Also I don't think we need the counting logic, just bail out if > we have any CPUs that are not supported. > > Marc, since you're the strongest opponent of this patch, are you still > opposed to making sure we don't try to run KVM on Big.Little until > support is properly introduced? Nothing I've seen so far proves that BL isn't working. Yes, we know the guest side is going to break. But the host should be solid, and if it isn't, let's fix it. What we've seen so far is an A15 hanging (Andre's trace), and Alexander's weird problem with the third A7 hanging. So far, I'd be inclined to say that BL is working well enough for that bug to be reproduced on both sides. > I also cannot see how we can fix the affinity issue easily from within > the kernel, do you have a concrete approach in mind you can share? My idea was to check the affinity of the vcpu thread, compare it to the "KVM affinity", and force it to the intersection of these two sets (rescheduling if not on an A15). If the intersection is null, just give up. Not pretty, but ensures nothing gets scheduled on an A7. M.
On 16 April 2013 17:59, Christoffer Dall <cdall@cs.columbia.edu> wrote: > On Mon, Apr 15, 2013 at 2:52 AM, Alexander Spyridakis > <a.spyridakis@virtualopensystems.com> wrote: >> I've run on this problem before, while trying to run KVM guests on A7 cores. >> >> For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the instruction >> that updates HSCTLR between the two isbs on __do_hyp_init (mcr p15, 4, r0, >> c1, c0, 0). If you boot the system with maxcpus=4 then init_hyp_mode() will >> not hang on the A7 cluster. Other than that from my limited testing KVM on >> A7 works on a usual linux guest. I also tried to only boot the 3rd A7 core >> to rule out any racing issues, but still the same behaviour applies. >> > You lost me on the maxcpus=4 and racing point here. If it works with > maxcpus=4 then it works with two A7s, right? Why would running with > maxcpus=3 (ie. one A7) rule out any race condition? maxcpus=3 would still work (only one A7), but I didn't test it that way. To make sure and avoid any race conditions from other cpus, I changed the bootwrapper to explicitly boot only the 3rd A7 (rest of the cpus being in an infinite loop) and it would still hang. Booting only the 1st or 2nd A7 core by the same method would work as expected (no hang in __do_hyp_init_). Regards.
[sorry, resending this as my client still insists on misspelling Marc's mail] On 16 April 2013 20:37, Alexander Spyridakis <a.spyridakis@virtualopensystems.com> wrote: > On 16 April 2013 17:59, Christoffer Dall <cdall@cs.columbia.edu> wrote: >> On Mon, Apr 15, 2013 at 2:52 AM, Alexander Spyridakis >> <a.spyridakis@virtualopensystems.com> wrote: >>> I've run on this problem before, while trying to run KVM guests on A7 cores. >>> >>> For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the instruction >>> that updates HSCTLR between the two isbs on __do_hyp_init (mcr p15, 4, r0, >>> c1, c0, 0). If you boot the system with maxcpus=4 then init_hyp_mode() will >>> not hang on the A7 cluster. Other than that from my limited testing KVM on >>> A7 works on a usual linux guest. I also tried to only boot the 3rd A7 core >>> to rule out any racing issues, but still the same behaviour applies. >>> >> You lost me on the maxcpus=4 and racing point here. If it works with >> maxcpus=4 then it works with two A7s, right? Why would running with >> maxcpus=3 (ie. one A7) rule out any race condition? > > maxcpus=3 would still work (only one A7), but I didn't test it that > way. To make sure and avoid any race conditions from other cpus, I > changed the bootwrapper to explicitly boot only the 3rd A7 (rest of > the cpus being in an infinite loop) and it would still hang. Booting > only the 1st or 2nd A7 core by the same method would work as expected > (no hang in __do_hyp_init_). > > Regards.
On Tue, Apr 16, 2013 at 11:37 AM, Alexander Spyridakis <a.spyridakis@virtualopensystems.com> wrote: > On 16 April 2013 17:59, Christoffer Dall <cdall@cs.columbia.edu> wrote: >> On Mon, Apr 15, 2013 at 2:52 AM, Alexander Spyridakis >> <a.spyridakis@virtualopensystems.com> wrote: >>> I've run on this problem before, while trying to run KVM guests on A7 cores. >>> >>> For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the instruction >>> that updates HSCTLR between the two isbs on __do_hyp_init (mcr p15, 4, r0, >>> c1, c0, 0). If you boot the system with maxcpus=4 then init_hyp_mode() will >>> not hang on the A7 cluster. Other than that from my limited testing KVM on >>> A7 works on a usual linux guest. I also tried to only boot the 3rd A7 core >>> to rule out any racing issues, but still the same behaviour applies. >>> >> You lost me on the maxcpus=4 and racing point here. If it works with >> maxcpus=4 then it works with two A7s, right? Why would running with >> maxcpus=3 (ie. one A7) rule out any race condition? > > maxcpus=3 would still work (only one A7), but I didn't test it that > way. To make sure and avoid any race conditions from other cpus, I > changed the bootwrapper to explicitly boot only the 3rd A7 (rest of > the cpus being in an infinite loop) and it would still hang. Booting > only the 1st or 2nd A7 core by the same method would work as expected > (no hang in __do_hyp_init_). > so work as expected, as in it doesn't work... Hence the confusion. I don't think this is racy in any way, but for the record, you still have a couple of other A15s running here, so you can't completely rule out races based on your experimental approach, but I really don't think that's the issue anyhow.
On 04/16/2013 06:24 PM, Christoffer Dall wrote: > On Mon, Apr 15, 2013 at 5:26 PM, Geoff Levand <geoff@infradead.org> wrote: >> On Sun, 2013-04-14 at 21:57 -0700, Christoffer Dall wrote: >>> On Fri, Apr 12, 2013 at 6:24 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: >>> Nak. The fact that one of the CPUs seem to hang is a sure sign that >>>> something is severely broken, and you definitely want to fix that issue, >>>> instead of blindly ignoring it. >>>> >>>> Additionally, it seems you're just papering over the issue. You should >>>> be able to exclude the A7 processors, but not completely deny KVM from >>>> running on the hardware. >>>> >>> Marc, I disagree with this nak. If the current kernel breaks boot on a >>> Big.Little system, we need to take care of that first, and the >>> proposed patch is a quick way to do so, and it does not stand in the >>> way of introducing proper Big.Little support in any way, which I'm >>> sure is going to open up a lot of other interesting questions. >>> >>> I'm going to take this one. >> >> Since this problem will cause the 3.9 kernel to hang then a workaround >> like this should go in. There isn't enough time to do a proper fix for >> 3.9, and even if it could be done I think it would be too intrusive to >> get merged this late. >> > That's why I was inclined to take the patch, but as Marc pointed out > the error message is incorrect, so that should be fixed at the very > least. Also I don't think we need the counting logic, just bail out if > we have any CPUs that are not supported. The counting logic is just to prevent an erroneous hint message. If Cluster 0 is A7, the first CPU checked will fail, the counter is 0 and KVM outputs the "Target CPU not supported!" message. But if at least one CPU already passed the test, we have a) a b.L system and b) restricting the number of CPUs with maxcpus= would help. Thus the hint at this point. But I am OK with removing both the hint message and the counting altogether, if you want to have an easier patch. Regards, Andre. > Marc, since you're the strongest opponent of this patch, are you still > opposed to making sure we don't try to run KVM on Big.Little until > support is properly introduced? > > I also cannot see how we can fix the affinity issue easily from within > the kernel, do you have a concrete approach in mind you can share? > > Note that this is irrespective of the boot delay issue that you guys > are observing. > > -Christoffer > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm >
On 04/16/2013 06:33 PM, Marc Zyngier wrote: > On Tue, 16 Apr 2013 09:26:26 -0700, Christoffer Dall > <cdall@cs.columbia.edu> wrote: >> On Mon, Apr 15, 2013 at 6:48 AM, Will Deacon <will.deacon@arm.com> > wrote: >>> On Mon, Apr 15, 2013 at 02:13:55PM +0100, Andre Przywara wrote: >>>> On 04/15/2013 11:52 AM, Alexander Spyridakis wrote: >>>>> I've run on this problem before, while trying to run KVM guests on > A7 >>>>> cores. >>>>> >>>>> For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the >>>>> instruction that updates HSCTLR between the two isbs on > __do_hyp_init >>>>> (mcr p15, 4, r0, c1, c0, 0). If you boot the system with maxcpus=4 >>>>> then >>>>> init_hyp_mode() will not hang on the A7 cluster. Other than that > from >>>>> my >>>>> limited testing KVM on A7 works on a usual linux guest. I also tried >>>>> to >>>>> only boot the 3rd A7 core to rule out any racing issues, but still > the >>>>> same behaviour applies. >>>> >>>> Could well be the same issue here. I chased it down till CPU 2 goes > into >>>> HYP mode to do the initialization. >>>> I am running with maxcpus=3 (this increases the likelyhood that >>>> kvm_target_cpu() runs on an A15), so CPU #2 is the only one A7. >>>> As the HYP mode exception table is empty except for the HVC trap, it > may >>>> be looping here. I am trying now to get the PC of the faulty >>>> instruction. >>> >>> Yes, it sounds like you're taking a recursive fault because the vectors >>> aren't installed yet. Is there any chance you can find out what value >>> you end >>> up writing (or trying to write) to the HSCTLR please? >>> >> Actually I'm a little confused, wasn't Andre seeing a halt on an A15 >> cpu, not an A7? Or is the theory that an A7 locks up and the calling >> A15 hangs on the SMP call to cpu_init_hyp_mode, waiting for the A7 to >> complete? > > Yes, A15 hanging, not A7. That's why I'm strongly opposed to this patch. > I'm pretty sure the A7s only have a side effect that triggers a kernel bug > on the A15 side. Before taking *any* patch around this, we should > understand the issue fully, and not start patching random stuff just > because Linus is going to tag 3.9. I think there is a misunderstanding. The RCU watchdog was complaining because the A15 wasn't making any progress. As Christoffer said, this is because it was waiting for CPU 2 to return from the SMP call. It is actually the A7 hanging inside HYP mode. I tried some ways to get information out of there, but had no luck so far. The different mapping between HYP and SVC doesn't make it easy to dump some variables, but I am still working on it (but only half steam because I am home looking after my sick daughter). So for now I assume that it is the HSCTLR setting Alexander observed already. Regards, Andre.
On Wed, 17 Apr 2013 10:08:12 +0200, Andre Przywara <andre.przywara@linaro.org> wrote: > On 04/16/2013 06:33 PM, Marc Zyngier wrote: >> On Tue, 16 Apr 2013 09:26:26 -0700, Christoffer Dall >> <cdall@cs.columbia.edu> wrote: >>> On Mon, Apr 15, 2013 at 6:48 AM, Will Deacon <will.deacon@arm.com> >> wrote: >>>> On Mon, Apr 15, 2013 at 02:13:55PM +0100, Andre Przywara wrote: >>>>> On 04/15/2013 11:52 AM, Alexander Spyridakis wrote: >>>>>> I've run on this problem before, while trying to run KVM guests on >> A7 >>>>>> cores. >>>>>> >>>>>> For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the >>>>>> instruction that updates HSCTLR between the two isbs on >> __do_hyp_init >>>>>> (mcr p15, 4, r0, c1, c0, 0). If you boot the system with maxcpus=4 >>>>>> then >>>>>> init_hyp_mode() will not hang on the A7 cluster. Other than that >> from >>>>>> my >>>>>> limited testing KVM on A7 works on a usual linux guest. I also tried >>>>>> to >>>>>> only boot the 3rd A7 core to rule out any racing issues, but still >> the >>>>>> same behaviour applies. >>>>> >>>>> Could well be the same issue here. I chased it down till CPU 2 goes >> into >>>>> HYP mode to do the initialization. >>>>> I am running with maxcpus=3 (this increases the likelyhood that >>>>> kvm_target_cpu() runs on an A15), so CPU #2 is the only one A7. >>>>> As the HYP mode exception table is empty except for the HVC trap, it >> may >>>>> be looping here. I am trying now to get the PC of the faulty >>>>> instruction. >>>> >>>> Yes, it sounds like you're taking a recursive fault because the vectors >>>> aren't installed yet. Is there any chance you can find out what value >>>> you end >>>> up writing (or trying to write) to the HSCTLR please? >>>> >>> Actually I'm a little confused, wasn't Andre seeing a halt on an A15 >>> cpu, not an A7? Or is the theory that an A7 locks up and the calling >>> A15 hangs on the SMP call to cpu_init_hyp_mode, waiting for the A7 to >>> complete? >> >> Yes, A15 hanging, not A7. That's why I'm strongly opposed to this patch. >> I'm pretty sure the A7s only have a side effect that triggers a kernel >> bug >> on the A15 side. Before taking *any* patch around this, we should >> understand the issue fully, and not start patching random stuff just >> because Linus is going to tag 3.9. > > I think there is a misunderstanding. The RCU watchdog was complaining > because the A15 wasn't making any progress. As Christoffer said, this is > because it was waiting for CPU 2 to return from the SMP call. It is > actually the A7 hanging inside HYP mode. > I tried some ways to get information out of there, but had no luck so > far. The different mapping between HYP and SVC doesn't make it easy to > dump some variables, but I am still working on it (but only half steam You could force a full mapping of the kernel text in HYP. Ugly, but should work. > because I am home looking after my sick daughter). So for now I assume > that it is the HSCTLR setting Alexander observed already. I'll give it a go today or tomorrow, depending how quickly I can get rid of my backlog after a couple of days off work. Assuming this is an A7 handing on HSCTLR access, it should be pretty easy to narrow down by booting only on the A7s, leaving the A15s held in reset. M.
On Fri, Apr 12, 2013 at 02:49:43PM +0100, Marc Zyngier wrote: > On 12/04/13 14:40, Peter Maydell wrote: > > On 12 April 2013 14:24, Marc Zyngier <marc.zyngier@arm.com> wrote: > >> Nak. The fact that one of the CPUs seem to hang is a sure sign that > >> something is severely broken, and you definitely want to fix that issue, > >> instead of blindly ignoring it. > >> > >> Additionally, it seems you're just papering over the issue. You should > >> be able to exclude the A7 processors, but not completely deny KVM from > >> running on the hardware. > > > > Well that might be nice, as would fully supporting big.LITTLE > > systems. But until somebody actually does that work it seems > > like a better idea to fail gracefully rather than having a 50% > > chance of failing gracefully and a 50% chance of going weird. > > Nothing prevents the kernel (or even the user) from forcing the affinity > of the CPU threads to the A15s. I'm not saying we should ignore the > problem either. Just that the proposed approach is wrong. But nothing guarantees that you get that affinity. If you offline all A15 CPUs, then you will find those threads running on whatever is left. Affinity is just a hint, nothing more.
On 17/04/13 11:19, Russell King - ARM Linux wrote: > On Fri, Apr 12, 2013 at 02:49:43PM +0100, Marc Zyngier wrote: >> On 12/04/13 14:40, Peter Maydell wrote: >>> On 12 April 2013 14:24, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>> Nak. The fact that one of the CPUs seem to hang is a sure sign that >>>> something is severely broken, and you definitely want to fix that issue, >>>> instead of blindly ignoring it. >>>> >>>> Additionally, it seems you're just papering over the issue. You should >>>> be able to exclude the A7 processors, but not completely deny KVM from >>>> running on the hardware. >>> >>> Well that might be nice, as would fully supporting big.LITTLE >>> systems. But until somebody actually does that work it seems >>> like a better idea to fail gracefully rather than having a 50% >>> chance of failing gracefully and a 50% chance of going weird. >> >> Nothing prevents the kernel (or even the user) from forcing the affinity >> of the CPU threads to the A15s. I'm not saying we should ignore the >> problem either. Just that the proposed approach is wrong. > > But nothing guarantees that you get that affinity. If you offline all > A15 CPUs, then you will find those threads running on whatever is left. > Affinity is just a hint, nothing more. I completely agree with you. But if we're left with only CPUs we can't run on, we're screwed and must abort. It's the same story as the RealView PB-X, where only one of the two A9 has NEON. If the NEON-capable core is down, any process using NEON is virtually dead. Should that be a reason to completely disable the HW (in this case the 3 A7s)? I'm not sure... M.
On Wed, Apr 17, 2013 at 3:35 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 17/04/13 11:19, Russell King - ARM Linux wrote: >> On Fri, Apr 12, 2013 at 02:49:43PM +0100, Marc Zyngier wrote: >>> On 12/04/13 14:40, Peter Maydell wrote: >>>> On 12 April 2013 14:24, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>> Nak. The fact that one of the CPUs seem to hang is a sure sign that >>>>> something is severely broken, and you definitely want to fix that issue, >>>>> instead of blindly ignoring it. >>>>> >>>>> Additionally, it seems you're just papering over the issue. You should >>>>> be able to exclude the A7 processors, but not completely deny KVM from >>>>> running on the hardware. >>>> >>>> Well that might be nice, as would fully supporting big.LITTLE >>>> systems. But until somebody actually does that work it seems >>>> like a better idea to fail gracefully rather than having a 50% >>>> chance of failing gracefully and a 50% chance of going weird. >>> >>> Nothing prevents the kernel (or even the user) from forcing the affinity >>> of the CPU threads to the A15s. I'm not saying we should ignore the >>> problem either. Just that the proposed approach is wrong. >> >> But nothing guarantees that you get that affinity. If you offline all >> A15 CPUs, then you will find those threads running on whatever is left. >> Affinity is just a hint, nothing more. > > I completely agree with you. But if we're left with only CPUs we can't > run on, we're screwed and must abort. > > It's the same story as the RealView PB-X, where only one of the two A9 > has NEON. If the NEON-capable core is down, any process using NEON is > virtually dead. > > Should that be a reason to completely disable the HW (in this case the 3 > A7s)? I'm not sure... > But we're not talking about disabling the A7's, we're talking about disabling KVM/ARM, a quite new feature, on a system where it's not well-tested and may cause boot problems or other issues that we haven't investigated in depth yet.
On 17/04/13 12:07, Christoffer Dall wrote: > On Wed, Apr 17, 2013 at 3:35 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: >> On 17/04/13 11:19, Russell King - ARM Linux wrote: >>> On Fri, Apr 12, 2013 at 02:49:43PM +0100, Marc Zyngier wrote: >>>> On 12/04/13 14:40, Peter Maydell wrote: >>>>> On 12 April 2013 14:24, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>> Nak. The fact that one of the CPUs seem to hang is a sure sign that >>>>>> something is severely broken, and you definitely want to fix that issue, >>>>>> instead of blindly ignoring it. >>>>>> >>>>>> Additionally, it seems you're just papering over the issue. You should >>>>>> be able to exclude the A7 processors, but not completely deny KVM from >>>>>> running on the hardware. >>>>> >>>>> Well that might be nice, as would fully supporting big.LITTLE >>>>> systems. But until somebody actually does that work it seems >>>>> like a better idea to fail gracefully rather than having a 50% >>>>> chance of failing gracefully and a 50% chance of going weird. >>>> >>>> Nothing prevents the kernel (or even the user) from forcing the affinity >>>> of the CPU threads to the A15s. I'm not saying we should ignore the >>>> problem either. Just that the proposed approach is wrong. >>> >>> But nothing guarantees that you get that affinity. If you offline all >>> A15 CPUs, then you will find those threads running on whatever is left. >>> Affinity is just a hint, nothing more. >> >> I completely agree with you. But if we're left with only CPUs we can't >> run on, we're screwed and must abort. >> >> It's the same story as the RealView PB-X, where only one of the two A9 >> has NEON. If the NEON-capable core is down, any process using NEON is >> virtually dead. >> >> Should that be a reason to completely disable the HW (in this case the 3 >> A7s)? I'm not sure... >> > But we're not talking about disabling the A7's, we're talking about > disabling KVM/ARM, a quite new feature, on a system where it's not > well-tested and may cause boot problems or other issues that we > haven't investigated in depth yet. Well, it's a choice between disabling KVM or the A7s. Cheese or dessert? Black death or cholera? And I go back to my earlier argument: we don't know what's going wrong. It could be a bug in our init code, it could be the bootloader that fails to correctly initialize the A7s in HYP mode, it could be something else. By disabling the problematic configuration, we just bury our head in the sand and pretend everything is fine. Yes, this is a new feature. But by saying "not a supported configuration", while the code *should* support it, we're only fooling ourselves. M.
On 17 April 2013 12:30, Marc Zyngier <marc.zyngier@arm.com> wrote: > By disabling the problematic configuration, we just bury our head in the > sand and pretend everything is fine. Yes, this is a new feature. But by > saying "not a supported configuration", while the code *should* support > it, we're only fooling ourselves. Who says the code should support running on big.LITTLE? I think it's very clear that the only thing it should support at the moment is A15 guest CPUs on an all-A15 host CPU system. -- PMM
On Wed, Apr 17, 2013 at 4:30 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 17/04/13 12:07, Christoffer Dall wrote: >> On Wed, Apr 17, 2013 at 3:35 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: >>> On 17/04/13 11:19, Russell King - ARM Linux wrote: >>>> On Fri, Apr 12, 2013 at 02:49:43PM +0100, Marc Zyngier wrote: >>>>> On 12/04/13 14:40, Peter Maydell wrote: >>>>>> On 12 April 2013 14:24, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>>>> Nak. The fact that one of the CPUs seem to hang is a sure sign that >>>>>>> something is severely broken, and you definitely want to fix that issue, >>>>>>> instead of blindly ignoring it. >>>>>>> >>>>>>> Additionally, it seems you're just papering over the issue. You should >>>>>>> be able to exclude the A7 processors, but not completely deny KVM from >>>>>>> running on the hardware. >>>>>> >>>>>> Well that might be nice, as would fully supporting big.LITTLE >>>>>> systems. But until somebody actually does that work it seems >>>>>> like a better idea to fail gracefully rather than having a 50% >>>>>> chance of failing gracefully and a 50% chance of going weird. >>>>> >>>>> Nothing prevents the kernel (or even the user) from forcing the affinity >>>>> of the CPU threads to the A15s. I'm not saying we should ignore the >>>>> problem either. Just that the proposed approach is wrong. >>>> >>>> But nothing guarantees that you get that affinity. If you offline all >>>> A15 CPUs, then you will find those threads running on whatever is left. >>>> Affinity is just a hint, nothing more. >>> >>> I completely agree with you. But if we're left with only CPUs we can't >>> run on, we're screwed and must abort. >>> >>> It's the same story as the RealView PB-X, where only one of the two A9 >>> has NEON. If the NEON-capable core is down, any process using NEON is >>> virtually dead. >>> >>> Should that be a reason to completely disable the HW (in this case the 3 >>> A7s)? I'm not sure... >>> >> But we're not talking about disabling the A7's, we're talking about >> disabling KVM/ARM, a quite new feature, on a system where it's not >> well-tested and may cause boot problems or other issues that we >> haven't investigated in depth yet. > > Well, it's a choice between disabling KVM or the A7s. Cheese or dessert? > Black death or cholera? > > And I go back to my earlier argument: we don't know what's going wrong. > It could be a bug in our init code, it could be the bootloader that > fails to correctly initialize the A7s in HYP mode, it could be something > else. > > By disabling the problematic configuration, we just bury our head in the > sand and pretend everything is fine. Yes, this is a new feature. But by > saying "not a supported configuration", while the code *should* support > it, we're only fooling ourselves. > I agree completely that we shouldn't hide some unknown bug by taking a wild guess, I'm obviously not arguing for this. However, we didn't verify that things should work on Big.Little, and we don't have any code to support A7's. We've written the code specifically with the idea that we emulate the underlying hardware and we expect that the target_cpu function will return something valid, which it does not for an A7, so no, it there is no expectation that things should work on Big.Little or an A7. Again, I stress that this is an orthogonal point to the specific issue that Andre/Alex are seeing on Big.Little.
On 17.04.2013, at 13:38, Peter Maydell wrote: > On 17 April 2013 12:30, Marc Zyngier <marc.zyngier@arm.com> wrote: >> By disabling the problematic configuration, we just bury our head in the >> sand and pretend everything is fine. Yes, this is a new feature. But by >> saying "not a supported configuration", while the code *should* support >> it, we're only fooling ourselves. > > Who says the code should support running on big.LITTLE? I think > it's very clear that the only thing it should support at the > moment is A15 guest CPUs on an all-A15 host CPU system. Is anyone going to run this code on big.LITTLE systems before 3.10 is out? Just fix it for real with 3.10 and everyone's happy. If need be, you can always CC the fixes to stable to have them in 3.9. Alex
On Wed, Apr 17, 2013 at 4:42 AM, Alexander Graf <agraf@suse.de> wrote: > > On 17.04.2013, at 13:38, Peter Maydell wrote: > >> On 17 April 2013 12:30, Marc Zyngier <marc.zyngier@arm.com> wrote: >>> By disabling the problematic configuration, we just bury our head in the >>> sand and pretend everything is fine. Yes, this is a new feature. But by >>> saying "not a supported configuration", while the code *should* support >>> it, we're only fooling ourselves. >> >> Who says the code should support running on big.LITTLE? I think >> it's very clear that the only thing it should support at the >> moment is A15 guest CPUs on an all-A15 host CPU system. > > Is anyone going to run this code on big.LITTLE systems before 3.10 is out? > Apparently yes, otherwise we would not be having this discussion in the first place. > Just fix it for real with 3.10 and everyone's happy. If need be, you can always CC the fixes to stable to have them in 3.9. > > That's really an added feature set, which I don't think is appropriate for stable. Currently we *do not support big.little* but we pretend that we do, and even worse, we pretend that we can emulate an A15 on an A7 for example, which we obviously cannot, and nobody looked at what kind of damage that can cause. Therefore, the "fix" for the current supported features is to bail out on something as funky as a big.little, and I honestly don't see what the big problem with that is. And, such a fix is what should go in stable.
On 17/04/13 12:38, Peter Maydell wrote: > On 17 April 2013 12:30, Marc Zyngier <marc.zyngier@arm.com> wrote: >> By disabling the problematic configuration, we just bury our head in the >> sand and pretend everything is fine. Yes, this is a new feature. But by >> saying "not a supported configuration", while the code *should* support >> it, we're only fooling ourselves. > > Who says the code should support running on big.LITTLE? I do. The host code definitely should. If it doesn't, then lets fix that bug, as it probably means the support is also broken on A15. As I said in another email, guest support is a separate issue. M.
On 17 April 2013 13:24, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 17/04/13 12:38, Peter Maydell wrote >> Who says the code should support running on big.LITTLE? > > I do. The host code definitely should. If it doesn't, then lets fix that > bug, as it probably means the support is also broken on A15. Last time I looked we weren't intercepting ID and feature registers. That's broken if the host isn't A15. -- PMM
On 17/04/13 13:25, Peter Maydell wrote: > On 17 April 2013 13:24, Marc Zyngier <marc.zyngier@arm.com> wrote: >> On 17/04/13 12:38, Peter Maydell wrote >>> Who says the code should support running on big.LITTLE? >> >> I do. The host code definitely should. If it doesn't, then lets fix that >> bug, as it probably means the support is also broken on A15. > > Last time I looked we weren't intercepting ID and feature > registers. That's broken if the host isn't A15. Would you please look at the sentence that follows the one you've quoted? It says "guest support is a separate issue". As in "orthogonal problem". M.
On 17 April 2013 13:28, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 17/04/13 13:25, Peter Maydell wrote: >> On 17 April 2013 13:24, Marc Zyngier <marc.zyngier@arm.com> wrote: >>> On 17/04/13 12:38, Peter Maydell wrote >>>> Who says the code should support running on big.LITTLE? >>> >>> I do. The host code definitely should. If it doesn't, then lets fix that >>> bug, as it probably means the support is also broken on A15. >> >> Last time I looked we weren't intercepting ID and feature >> registers. That's broken if the host isn't A15. > > Would you please look at the sentence that follows the one you've > quoted? It says "guest support is a separate issue". As in "orthogonal > problem". I'm confused now, because I thought the whole point of this patch was that we don't support guests on big.LITTLE but we're not locking them out (the way we would for an all-A7 system). That is, we're fixing a bug in the check code. Obviously it seems worth investigating exactly what is happening on the A7s when we try to run the init code, but it seems to me like that's the orthogonal issue, because we shouldn't be running that code in the first place. -- PMM
On 17/04/13 13:38, Peter Maydell wrote: > On 17 April 2013 13:28, Marc Zyngier <marc.zyngier@arm.com> wrote: >> On 17/04/13 13:25, Peter Maydell wrote: >>> On 17 April 2013 13:24, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>> On 17/04/13 12:38, Peter Maydell wrote >>>>> Who says the code should support running on big.LITTLE? >>>> >>>> I do. The host code definitely should. If it doesn't, then lets fix that >>>> bug, as it probably means the support is also broken on A15. >>> >>> Last time I looked we weren't intercepting ID and feature >>> registers. That's broken if the host isn't A15. >> >> Would you please look at the sentence that follows the one you've >> quoted? It says "guest support is a separate issue". As in "orthogonal >> problem". > > I'm confused now, because I thought the whole point of this > patch was that we don't support guests on big.LITTLE but > we're not locking them out (the way we would for an all-A7 > system). That is, we're fixing a bug in the check code. > > Obviously it seems worth investigating exactly what is > happening on the A7s when we try to run the init code, > but it seems to me like that's the orthogonal issue, > because we shouldn't be running that code in the first > place. Turns out that init code was designed to run unchanged on both A15 and A7. If that code isn't properly working on a HYP-aware CPU, then we're in trouble. I don't give a damn about full BL support in KVM at that stage. All I want is a clear description of *why* that code is failing, and possibly a way to run guests on the A15 side of a BL platform, without having to give away half of the cores. I think I've already spent too much time arguing about this. Hopefully someone will be interested enough in BL to revert this patch and fix the code (whether it is in the kernel or bootloader). M.
On Wed, Apr 17, 2013 at 6:00 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 17/04/13 13:38, Peter Maydell wrote: >> On 17 April 2013 13:28, Marc Zyngier <marc.zyngier@arm.com> wrote: >>> On 17/04/13 13:25, Peter Maydell wrote: >>>> On 17 April 2013 13:24, Marc Zyngier <marc.zyngier@arm.com> wrote: >>>>> On 17/04/13 12:38, Peter Maydell wrote >>>>>> Who says the code should support running on big.LITTLE? >>>>> >>>>> I do. The host code definitely should. If it doesn't, then lets fix that >>>>> bug, as it probably means the support is also broken on A15. >>>> >>>> Last time I looked we weren't intercepting ID and feature >>>> registers. That's broken if the host isn't A15. >>> >>> Would you please look at the sentence that follows the one you've >>> quoted? It says "guest support is a separate issue". As in "orthogonal >>> problem". >> >> I'm confused now, because I thought the whole point of this >> patch was that we don't support guests on big.LITTLE but >> we're not locking them out (the way we would for an all-A7 >> system). That is, we're fixing a bug in the check code. >> >> Obviously it seems worth investigating exactly what is >> happening on the A7s when we try to run the init code, >> but it seems to me like that's the orthogonal issue, >> because we shouldn't be running that code in the first >> place. > > Turns out that init code was designed to run unchanged on both A15 and > A7. If that code isn't properly working on a HYP-aware CPU, then we're > in trouble. > yes yes, we agree here, but it's a separate issue. > I don't give a damn about full BL support in KVM at that stage. All I > want is a clear description of *why* that code is failing, and possibly > a way to run guests on the A15 side of a BL platform, without having to > give away half of the cores. > > I think I've already spent too much time arguing about this. Hopefully > someone will be interested enough in BL to revert this patch and fix the > code (whether it is in the kernel or bootloader). > Currently we only support A15s, and that's explicit. You're arguing that we should pretend like we support other systems as well, should we then also just remove the check all together and let KVM run on a krait, on an A42 [1], and on foo, and hope for the best? Or should we be nice and tell the rest of the system, sorry this is not supported yet? I think the latter. Seriously, I don't think it sounds like a quick fix to make the code run only on A15s on BL, but if you do, please send the patch, and I'll be happy to review and merge that code. -Christoffer [1]: Made this up.
On 04/17/2013 11:12 AM, Christoffer Dall wrote: > > > > On Wed, Apr 17, 2013 at 1:16 AM, Marc Zyngier <marc.zyngier@arm.com > <mailto:marc.zyngier@arm.com>> wrote: > > On Wed, 17 Apr 2013 10:08:12 +0200, Andre Przywara > <andre.przywara@linaro.org <mailto:andre.przywara@linaro.org>> wrote: > > On 04/16/2013 06:33 PM, Marc Zyngier wrote: > >> On Tue, 16 Apr 2013 09:26:26 -0700, Christoffer Dall > >> <cdall@cs.columbia.edu <mailto:cdall@cs.columbia.edu>> wrote: > >>> On Mon, Apr 15, 2013 at 6:48 AM, Will Deacon > <will.deacon@arm.com <mailto:will.deacon@arm.com>> > >> wrote: > >>>> On Mon, Apr 15, 2013 at 02:13:55PM +0100, Andre Przywara wrote: > >>>>> On 04/15/2013 11:52 AM, Alexander Spyridakis wrote: > >>>>>> I've run on this problem before, while trying to run KVM > guests on > >> A7 > >>>>>> cores. > >>>>>> > >>>>>> For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on the > >>>>>> instruction that updates HSCTLR between the two isbs on > >> __do_hyp_init > >>>>>> (mcr p15, 4, r0, c1, c0, 0). If you boot the system with > maxcpus=4 > >>>>>> then > >>>>>> init_hyp_mode() will not hang on the A7 cluster. Other than that > >> from > >>>>>> my > >>>>>> limited testing KVM on A7 works on a usual linux guest. I also > tried > >>>>>> to > >>>>>> only boot the 3rd A7 core to rule out any racing issues, but > still > >> the > >>>>>> same behaviour applies. > >>>>> > >>>>> Could well be the same issue here. I chased it down till CPU > 2 goes > >> into > >>>>> HYP mode to do the initialization. > >>>>> I am running with maxcpus=3 (this increases the likelyhood that > >>>>> kvm_target_cpu() runs on an A15), so CPU #2 is the only one A7. > >>>>> As the HYP mode exception table is empty except for the HVC > trap, it > >> may > >>>>> be looping here. I am trying now to get the PC of the faulty > >>>>> instruction. > >>>> > >>>> Yes, it sounds like you're taking a recursive fault because the > vectors > >>>> aren't installed yet. Is there any chance you can find out > what value > >>>> you end > >>>> up writing (or trying to write) to the HSCTLR please? > >>>> > >>> Actually I'm a little confused, wasn't Andre seeing a halt on > an A15 > >>> cpu, not an A7? Or is the theory that an A7 locks up and the > calling > >>> A15 hangs on the SMP call to cpu_init_hyp_mode, waiting for the > A7 to > >>> complete? > >> > >> Yes, A15 hanging, not A7. That's why I'm strongly opposed to this > patch. > >> I'm pretty sure the A7s only have a side effect that triggers a > kernel > >> bug > >> on the A15 side. Before taking *any* patch around this, we should > >> understand the issue fully, and not start patching random stuff just > >> because Linus is going to tag 3.9. > > > > I think there is a misunderstanding. The RCU watchdog was complaining > > because the A15 wasn't making any progress. As Christoffer said, > this is > > > because it was waiting for CPU 2 to return from the SMP call. It is > > actually the A7 hanging inside HYP mode. > > I tried some ways to get information out of there, but had no luck so > > far. The different mapping between HYP and SVC doesn't make it > easy to > > dump some variables, but I am still working on it (but only half > steam > > You could force a full mapping of the kernel text in HYP. Ugly, but > should > work. > > > because I am home looking after my sick daughter). So for now I > assume > > that it is the HSCTLR setting Alexander observed already. > > I'll give it a go today or tomorrow, depending how quickly I can get rid > of my backlog after a couple of days off work. > > Assuming this is an A7 handing on HSCTLR access, it should be pretty > easy > to narrow down by booting only on the A7s, leaving the A15s held in > reset. > > > You could also try installing a vector handler early and detect faults, > and add an alternative return path from the init function with some > error reporting value in r0 or something like that, just for debugging, > naturally, but that could be a way to detect if we really are taking > recursive faults here. OK, I added code to return earlier on CPUs not from cluster 0. Indeed it hangs in the HSCR write. The two A15s pass this instruction, writing 0x30c5187F into the register. This means all the fixed bits for A15 correctly, C,A,M and I set and WXN, EE, TE cleared. FI was also cleared The A7 wanted to write the very same value. I tried to set bit 21, which kind of the A7 TRM hints to do: but no change. Before the HSCLTR write, the register reads 0x30c50878, with SCTLR being 0x30c5387d. So the code wants to set M, A, C and I in HSCLTR. Interestingly SCTLR has the V bits set, could that be an issue? Regards, Andre.
On Fri, Apr 19, 2013 at 5:58 AM, Andre Przywara <andre.przywara@linaro.org> wrote: > On 04/17/2013 11:12 AM, Christoffer Dall wrote: >> >> >> >> >> On Wed, Apr 17, 2013 at 1:16 AM, Marc Zyngier <marc.zyngier@arm.com >> <mailto:marc.zyngier@arm.com>> wrote: >> >> On Wed, 17 Apr 2013 10:08:12 +0200, Andre Przywara >> <andre.przywara@linaro.org <mailto:andre.przywara@linaro.org>> wrote: >> > On 04/16/2013 06:33 PM, Marc Zyngier wrote: >> >> On Tue, 16 Apr 2013 09:26:26 -0700, Christoffer Dall >> >> <cdall@cs.columbia.edu <mailto:cdall@cs.columbia.edu>> wrote: >> >>> On Mon, Apr 15, 2013 at 6:48 AM, Will Deacon >> <will.deacon@arm.com <mailto:will.deacon@arm.com>> >> >> >> wrote: >> >>>> On Mon, Apr 15, 2013 at 02:13:55PM +0100, Andre Przywara wrote: >> >>>>> On 04/15/2013 11:52 AM, Alexander Spyridakis wrote: >> >>>>>> I've run on this problem before, while trying to run KVM >> guests on >> >> A7 >> >>>>>> cores. >> >>>>>> >> >>>>>> For some reason the 3rd A7 hangs in arch/arm/kvm/init.S, on >> the >> >>>>>> instruction that updates HSCTLR between the two isbs on >> >> __do_hyp_init >> >>>>>> (mcr p15, 4, r0, c1, c0, 0). If you boot the system with >> maxcpus=4 >> >>>>>> then >> >>>>>> init_hyp_mode() will not hang on the A7 cluster. Other than >> that >> >> from >> >>>>>> my >> >>>>>> limited testing KVM on A7 works on a usual linux guest. I also >> tried >> >>>>>> to >> >>>>>> only boot the 3rd A7 core to rule out any racing issues, but >> still >> >> the >> >>>>>> same behaviour applies. >> >>>>> >> >>>>> Could well be the same issue here. I chased it down till CPU >> 2 goes >> >> into >> >>>>> HYP mode to do the initialization. >> >>>>> I am running with maxcpus=3 (this increases the likelyhood that >> >>>>> kvm_target_cpu() runs on an A15), so CPU #2 is the only one A7. >> >>>>> As the HYP mode exception table is empty except for the HVC >> trap, it >> >> may >> >>>>> be looping here. I am trying now to get the PC of the faulty >> >>>>> instruction. >> >>>> >> >>>> Yes, it sounds like you're taking a recursive fault because the >> vectors >> >>>> aren't installed yet. Is there any chance you can find out >> what value >> >>>> you end >> >>>> up writing (or trying to write) to the HSCTLR please? >> >>>> >> >>> Actually I'm a little confused, wasn't Andre seeing a halt on >> an A15 >> >>> cpu, not an A7? Or is the theory that an A7 locks up and the >> calling >> >>> A15 hangs on the SMP call to cpu_init_hyp_mode, waiting for the >> A7 to >> >>> complete? >> >> >> >> Yes, A15 hanging, not A7. That's why I'm strongly opposed to this >> patch. >> >> I'm pretty sure the A7s only have a side effect that triggers a >> kernel >> >> bug >> >> on the A15 side. Before taking *any* patch around this, we should >> >> understand the issue fully, and not start patching random stuff >> just >> >> because Linus is going to tag 3.9. >> > >> > I think there is a misunderstanding. The RCU watchdog was >> complaining >> > because the A15 wasn't making any progress. As Christoffer said, >> this is >> >> > because it was waiting for CPU 2 to return from the SMP call. It is >> > actually the A7 hanging inside HYP mode. >> > I tried some ways to get information out of there, but had no luck >> so >> > far. The different mapping between HYP and SVC doesn't make it >> easy to >> > dump some variables, but I am still working on it (but only half >> steam >> >> You could force a full mapping of the kernel text in HYP. Ugly, but >> should >> work. >> >> > because I am home looking after my sick daughter). So for now I >> assume >> > that it is the HSCTLR setting Alexander observed already. >> >> I'll give it a go today or tomorrow, depending how quickly I can get >> rid >> of my backlog after a couple of days off work. >> >> Assuming this is an A7 handing on HSCTLR access, it should be pretty >> easy >> to narrow down by booting only on the A7s, leaving the A15s held in >> reset. >> >> >> You could also try installing a vector handler early and detect faults, >> and add an alternative return path from the init function with some >> error reporting value in r0 or something like that, just for debugging, >> naturally, but that could be a way to detect if we really are taking >> recursive faults here. > > > OK, I added code to return earlier on CPUs not from cluster 0. > Indeed it hangs in the HSCR write. The two A15s pass this instruction, > writing 0x30c5187F into the register. > This means all the fixed bits for A15 correctly, C,A,M and I set and WXN, > EE, TE cleared. FI was also cleared > The A7 wanted to write the very same value. I tried to set bit 21, which > kind of the A7 TRM hints to do: but no change. > Before the HSCLTR write, the register reads 0x30c50878, with SCTLR being > 0x30c5387d. > So the code wants to set M, A, C and I in HSCLTR. Interestingly SCTLR has > the V bits set, could that be an issue? > Can you try writing 0x30c50879 into the register instead? Basically check to see if enabling caches or alignment checks causes the issue, or if it is indeed enabling the MMU that's the issue... If that works, start a bisect on the remaining bits. Also, just for fun, could you try flushing the entire I-cache before writing into the HSCLTR? Why wouldn't the V bit be set in the SCTLR, Linus uses high vectors (at 0xffff0000) for exception handling on ARM. -Christoffer
On 04/19/2013 06:13 PM, Christoffer Dall wrote: > On Fri, Apr 19, 2013 at 5:58 AM, Andre Przywara > <andre.przywara@linaro.org> wrote: >> On 04/17/2013 11:12 AM, Christoffer Dall wrote: >>> >>> ... >>> >>> You could also try installing a vector handler early and detect faults, >>> and add an alternative return path from the init function with some >>> error reporting value in r0 or something like that, just for debugging, >>> naturally, but that could be a way to detect if we really are taking >>> recursive faults here. >> >> >> OK, I added code to return earlier on CPUs not from cluster 0. >> Indeed it hangs in the HSCR write. The two A15s pass this instruction, >> writing 0x30c5187F into the register. >> This means all the fixed bits for A15 correctly, C,A,M and I set and WXN, >> EE, TE cleared. FI was also cleared >> The A7 wanted to write the very same value. I tried to set bit 21, which >> kind of the A7 TRM hints to do: but no change. >> Before the HSCLTR write, the register reads 0x30c50878, with SCTLR being >> 0x30c5387d. >> So the code wants to set M, A, C and I in HSCLTR. Interestingly SCTLR has >> the V bits set, could that be an issue? >> > Can you try writing 0x30c50879 into the register instead? Basically > check to see if enabling caches or alignment checks causes the issue, > or if it is indeed enabling the MMU that's the issue... If that works, > start a bisect on the remaining bits. Also, just for fun, could you > try flushing the entire I-cache before writing into the HSCLTR? OK, both clearing the I-bit and doing an "isb; ICIALLU" before the "isb; write HSCLTR; isb" worked, the kernel boots on and KVM is enabled. I could easily make a patch, but I am not sure how to proceed from here: 1.) At least I don't have an understanding why this is only a problem on A7 and not on A15. I would feel better if we have an explanation for this. Mark, Will, Peter: any ideas? 2.) We still do not claim to support A7s. ARM_CPU_PART_CORTEX_A15 is the only MIDR content not returning -EINVAL. And there is no coproc_a7.c yet. From browsing through coproc_a15.c I sense it looks compatible to the A7, only handling L2CTLR[25:24] and ACTLR[SMP] explicitly. On pure A7 systems this should work, but in mixed (aka. b.L) systems guests would see inconsistent input depending on the current scheduling. This looks like we need a real solution, which needs some time and work (read: proper b.L support). 3.) Even if we would understand this fix and take it, the kernel currently still randomly enables or disables KVM - with only A15 supported and the check routine running on one of the cores only. And even if we add A7 to the list, I think we still need to iterate through all cores to catch unsupported cores in a different b.L configuration. I suppose it is to late for 3.9 in any case right? Or should we try to push at least the ICIALLU fix? > Why wouldn't the V bit be set in the SCTLR, Linus uses high vectors > (at 0xffff0000) for exception handling on ARM. Shame on me, didn't know this. I assumed that those high vectors are there for legacy reasons only. Good to know and thanks for telling me. Regards, Andre.
On 22/04/13 11:36, Andre Przywara wrote: > On 04/19/2013 06:13 PM, Christoffer Dall wrote: >> On Fri, Apr 19, 2013 at 5:58 AM, Andre Przywara >> <andre.przywara@linaro.org> wrote: >>> On 04/17/2013 11:12 AM, Christoffer Dall wrote: >>>> >>>> ... >>>> >>>> You could also try installing a vector handler early and detect faults, >>>> and add an alternative return path from the init function with some >>>> error reporting value in r0 or something like that, just for debugging, >>>> naturally, but that could be a way to detect if we really are taking >>>> recursive faults here. >>> >>> >>> OK, I added code to return earlier on CPUs not from cluster 0. >>> Indeed it hangs in the HSCR write. The two A15s pass this instruction, >>> writing 0x30c5187F into the register. >>> This means all the fixed bits for A15 correctly, C,A,M and I set and WXN, >>> EE, TE cleared. FI was also cleared >>> The A7 wanted to write the very same value. I tried to set bit 21, which >>> kind of the A7 TRM hints to do: but no change. >>> Before the HSCLTR write, the register reads 0x30c50878, with SCTLR being >>> 0x30c5387d. >>> So the code wants to set M, A, C and I in HSCLTR. Interestingly SCTLR has >>> the V bits set, could that be an issue? >>> >> Can you try writing 0x30c50879 into the register instead? Basically >> check to see if enabling caches or alignment checks causes the issue, >> or if it is indeed enabling the MMU that's the issue... If that works, >> start a bisect on the remaining bits. Also, just for fun, could you >> try flushing the entire I-cache before writing into the HSCLTR? > > OK, both clearing the I-bit and doing an "isb; ICIALLU" before the "isb; > write HSCLTR; isb" worked, the kernel boots on and KVM is enabled. > > I could easily make a patch, but I am not sure how to proceed from here: > > 1.) At least I don't have an understanding why this is only a problem on > A7 and not on A15. I would feel better if we have an explanation for > this. Mark, Will, Peter: any ideas? Blink! Can you check your board.txt config file? It should have a line that starts with "SCC: 0x400 ..." If not, can you add a line that reads: SCC: 0x400 0x00000000 Thanks, M.
On Mon, 22 Apr 2013 12:02:59 +0100, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 22/04/13 11:36, Andre Przywara wrote: >> On 04/19/2013 06:13 PM, Christoffer Dall wrote: >>> On Fri, Apr 19, 2013 at 5:58 AM, Andre Przywara >>> <andre.przywara@linaro.org> wrote: >>>> On 04/17/2013 11:12 AM, Christoffer Dall wrote: >>>>> >>>>> ... >>>>> >>>>> You could also try installing a vector handler early and detect >>>>> faults, >>>>> and add an alternative return path from the init function with some >>>>> error reporting value in r0 or something like that, just for >>>>> debugging, >>>>> naturally, but that could be a way to detect if we really are taking >>>>> recursive faults here. >>>> >>>> >>>> OK, I added code to return earlier on CPUs not from cluster 0. >>>> Indeed it hangs in the HSCR write. The two A15s pass this instruction, >>>> writing 0x30c5187F into the register. >>>> This means all the fixed bits for A15 correctly, C,A,M and I set and >>>> WXN, >>>> EE, TE cleared. FI was also cleared >>>> The A7 wanted to write the very same value. I tried to set bit 21, >>>> which >>>> kind of the A7 TRM hints to do: but no change. >>>> Before the HSCLTR write, the register reads 0x30c50878, with SCTLR >>>> being >>>> 0x30c5387d. >>>> So the code wants to set M, A, C and I in HSCLTR. Interestingly SCTLR >>>> has >>>> the V bits set, could that be an issue? >>>> >>> Can you try writing 0x30c50879 into the register instead? Basically >>> check to see if enabling caches or alignment checks causes the issue, >>> or if it is indeed enabling the MMU that's the issue... If that works, >>> start a bisect on the remaining bits. Also, just for fun, could you >>> try flushing the entire I-cache before writing into the HSCLTR? >> >> OK, both clearing the I-bit and doing an "isb; ICIALLU" before the "isb; >> write HSCLTR; isb" worked, the kernel boots on and KVM is enabled. >> >> I could easily make a patch, but I am not sure how to proceed from here: >> >> 1.) At least I don't have an understanding why this is only a problem on >> A7 and not on A15. I would feel better if we have an explanation for >> this. Mark, Will, Peter: any ideas? > > Blink! > > Can you check your board.txt config file? It should have a line that > starts with "SCC: 0x400 ..." > > If not, can you add a line that reads: > SCC: 0x400 0x00000000 Actually you, may want to try: SCC: 0x400 0x3330c00 instead, as Peter pointed out that some bits are flagged as reserved in the TRM... M.
On 04/22/2013 01:14 PM, Marc Zyngier wrote: > On Mon, 22 Apr 2013 12:02:59 +0100, Marc Zyngier <marc.zyngier@arm.com> > wrote: >> On 22/04/13 11:36, Andre Przywara wrote: >>> On 04/19/2013 06:13 PM, Christoffer Dall wrote: >>>> On Fri, Apr 19, 2013 at 5:58 AM, Andre Przywara >>>> <andre.przywara@linaro.org> wrote: >>>>> On 04/17/2013 11:12 AM, Christoffer Dall wrote: >>>>>> >>>>>> ... >>>>>> >>>>>> You could also try installing a vector handler early and detect >>>>>> faults, >>>>>> and add an alternative return path from the init function with some >>>>>> error reporting value in r0 or something like that, just for >>>>>> debugging, >>>>>> naturally, but that could be a way to detect if we really are taking >>>>>> recursive faults here. >>>>> >>>>> >>>>> OK, I added code to return earlier on CPUs not from cluster 0. >>>>> Indeed it hangs in the HSCR write. The two A15s pass this > instruction, >>>>> writing 0x30c5187F into the register. >>>>> This means all the fixed bits for A15 correctly, C,A,M and I set and >>>>> WXN, >>>>> EE, TE cleared. FI was also cleared >>>>> The A7 wanted to write the very same value. I tried to set bit 21, >>>>> which >>>>> kind of the A7 TRM hints to do: but no change. >>>>> Before the HSCLTR write, the register reads 0x30c50878, with SCTLR >>>>> being >>>>> 0x30c5387d. >>>>> So the code wants to set M, A, C and I in HSCLTR. Interestingly SCTLR >>>>> has >>>>> the V bits set, could that be an issue? >>>>> >>>> Can you try writing 0x30c50879 into the register instead? Basically >>>> check to see if enabling caches or alignment checks causes the issue, >>>> or if it is indeed enabling the MMU that's the issue... If that works, >>>> start a bisect on the remaining bits. Also, just for fun, could you >>>> try flushing the entire I-cache before writing into the HSCLTR? >>> >>> OK, both clearing the I-bit and doing an "isb; ICIALLU" before the > "isb; >>> write HSCLTR; isb" worked, the kernel boots on and KVM is enabled. >>> Arrgh! While cleaning up the code I saw that I had left a debug exit point early in init.S, so the HSCTLR write was not executed. Fixing this I can now say that cleaning the I-cache and even disabling it does not help. Sorry for the premature hooray! In fact the only thing that makes the kernel boot is cleaning the M bit (disabling the MMU), which is a debug hint, but by no means a useful workaround :-( So the problem still persists. >>> I could easily make a patch, but I am not sure how to proceed from > here: >>> >>> 1.) At least I don't have an understanding why this is only a problem > on >>> A7 and not on A15. I would feel better if we have an explanation for >>> this. Mark, Will, Peter: any ideas? >> >> Blink! >> >> Can you check your board.txt config file? It should have a line that >> starts with "SCC: 0x400 ..." >> >> If not, can you add a line that reads: >> SCC: 0x400 0x00000000 > > Actually you, may want to try: > SCC: 0x400 0x3330c00 > > instead, as Peter pointed out that some bits are flagged as reserved in > the TRM... I tried to play with SCC 0x400 for two boots, but it didn't change anything. Didn't do the full scale of possible HSCLTR bits with and without this, though. Does that make sense to do this, or was SCC 0x400 only for cache problems? Regards, Andre.
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 5a93698..9b53cdd 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -1128,20 +1128,35 @@ out_err: return err; } +static void check_kvm_target_cpu(void *ret) +{ + *(int*)ret = kvm_target_cpu(); +} + /** * Initialize Hyp-mode and memory mappings on all CPUs. */ int kvm_arch_init(void *opaque) { int err; + int count = 0, ret, cpu; if (!is_hyp_mode_available()) { kvm_err("HYP mode not available\n"); return -ENODEV; } - if (kvm_target_cpu() < 0) { - kvm_err("Target CPU not supported!\n"); + for_each_online_cpu(cpu) { + smp_call_function_single(cpu, check_kvm_target_cpu, &ret, 1); + if (ret >= 0) { + count++; + continue; + } + if (count) + kvm_err("big.LITTLE not supported, limit to A15 CPUs with maxcpus= to use KVM\n"); + else + kvm_err("Target CPU not supported!\n"); + return -ENODEV; }
kvm_target_cpus() checks the compatibility of the used CPU with KVM, which is currently limited to ARM Cortex-A15 cores. However by calling it only once on any random CPU it assumes that all cores are the same, which is not true for big.LITTLE parts. After doing about 40 boots on a TC-2 core tile, I found it running in all but one case on one of the A7 cores (which correctly denied KVM initialization). On the 39th boot however the code ran on an A15, leading to a hang after returning success: ... TCP: reno registered UDP hash table entries: 512 (order: 2, 16384 bytes) UDP-Lite hash table entries: 512 (order: 2, 16384 bytes) NET: Registered protocol family 1 kvm_target_cpu() on CPU #1, part is c0f0 ... (pause for a while) ... INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected stalls on CPU s/tasks: { 1} (detected by 0, t=6002 jiffies, g=4294966999, c=4294966998, q=15) Task dump for CPU 1: swapper/0 R running 0 1 0 0x00000002 So iterate over every CPU to correctly determine the capability of the system to run the current KVM implementation. In case a big.LITTLE configuration is the reason for denial, give the user a hint how to get it running anyway (maxcpus= on the kernel command line). Please push this still into 3.9. Signed-off-by: Andre Przywara <andre.przywara@linaro.org> --- arch/arm/kvm/arm.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)