Message ID | 20140102230832.GB9339@obsidianresearch.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> .. and since the problem only shows obviously with multiplatform > kernels that have been relocated it could be a broader, subtler issue > than just your one board ... It is not just this board. It is any board using an old mainline u-boot. Any such system will not boot with a kernel after 3.1. It was at that point that the issue was found and fixed in u-boot. My Bubba3 has a u-boot before that point... > Jason C: What do you think about applying this [untested]? I'm away for the weekend, so cannot test it. > Or do you think the kernel should support booting with L2 on (this is > just two instructions, but they need to be conditional on the CPU type > :( :( )? It would be good to at least put the fix into the impedance matcher, so there is no need to mess with the kernel. Andrew
On Thu, Jan 02, 2014 at 04:08:32PM -0700, Jason Gunthorpe wrote: > On Thu, Jan 02, 2014 at 10:36:32PM +0000, Ian Campbell wrote: > > On Thu, 2014-01-02 at 12:49 -0700, Jason Gunthorpe wrote: > > > On Sat, Dec 28, 2013 at 12:01:14PM -0500, Jason Cooper wrote: > > > > > > > > + * Note: This requires a new'ish version of u-boot, which disables the > > > > > + * L2 cache. If your B3 silently fails to boot, u-boot is probably too > > > > > + * old. Either upgrade, or consider the following email: > > > > > + * > > > > > + * http://lists.debian.org/debian-arm/2012/08/msg00128.html > > > > > > > > Nice, thanks for adding this. > > > > > > Nifty.. But what is the root cause for the above? > > > > > > I'm guessing that at some point the the L1 icache has been enabled, > > > the L1 dcache disabled, and the L2 cache enabled? > > > > > > If the pv_fixup runs with both L1 caches off and an empty L2 then it > > > shouldn't cause a problem. > > > > I'm not 100% sure but I think the issue is that the kernel expects to be > > running with all of the caches off, but u-boot incorrectly left the l2 > > on. The L2 is probably not empty (because u-boot was using it). > > I think I see the problem now.. The decompressor turns on the L1 > dcache, which populates the enabled L2 with write through data. When > the decompressor turns the dcache off again the L2 retains some of the > kernel image and any data writes done in head.S while the dcache is > off could be corrupted if they overlap the L2's retention, > specifically the pv_fixup stuff, but other elements could potentially > be mangled too. > > So all Marvell platforms with their L2 cache controller must boot into > the decompressor with the L2 cache completely off, or the kernel needs > additional Marvell L2 specific flushing code prior to enabling the MMU > :( > > .. and since the problem only shows obviously with multiplatform > kernels that have been relocated it could be a broader, subtler issue > than just your one board ... > > Jason C: What do you think about applying this [untested]? I'm of the opinion that the kernel shouldn't assume the state of the system when it begins execution. Sure it's nice to point fingers at the bootloader (especially since it's obviously wrong in this situation), but we need Linux to boot on systems with non-upgradeable bootloaders which may be broken. I'm specifically thinking about hobbyists modifying boxes they bought at $bigbox stores. So, I'd prefer to handle this more gracefully. I don't have much experience at the low-level init of the caches, couldn't we enable and flush rather than throwing the error? thx, Jason (C). > Or do you think the kernel should support booting with L2 on (this is > just two instructions, but they need to be conditional on the CPU type > :( :( )? > > ----- > arm: Kirkwood: Log a FW_BUG if the L2 cache is turned on at boot > > Linux can't cope if the the L2 cache is enabled at boot. It works > sometimes, if the data writes head.S does during the L1 cache-off > period do not overlap with lines in the L2. > > This is unlikely if CONFIG_ARM_PATCH_PHYS_VIRT is used, as the > decompressor will pull a chunk of the kernel code into L2 cache. > > Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> > > diff --git a/arch/arm/mm/cache-feroceon-l2.c b/arch/arm/mm/cache-feroceon-l2.c > index 48bc3c0..68781a4 100644 > --- a/arch/arm/mm/cache-feroceon-l2.c > +++ b/arch/arm/mm/cache-feroceon-l2.c > @@ -331,7 +331,8 @@ static void __init enable_l2(void) > enable_icache(); > if (d) > enable_dcache(); > - } > + } else > + pr_err(FW_BUG "Feroceon L2: bootloader left the L2 cache on!\n"); > } > > void __init feroceon_l2_init(int __l2_wt_override) > > > Jason
On Fri, Jan 10, 2014 at 02:20:32PM -0500, Jason Cooper wrote: > > Jason C: What do you think about applying this [untested]? > > I'm of the opinion that the kernel shouldn't assume the state of the > system when it begins execution. Sure it's nice to point fingers at the > bootloader (especially since it's obviously wrong in this situation), > but we need Linux to boot on systems with non-upgradeable bootloaders > which may be broken. I'm specifically thinking about hobbyists > modifying boxes they bought at $bigbox stores. I liked the idea of your impedance matcher fixing up bootloader problems. It would be pretty easy to have a feroceon specific path in the impedance matcher to disable the L2. > So, I'd prefer to handle this more gracefully. I don't have much > experience at the low-level init of the caches, couldn't we enable and > flush rather than throwing the error? It cannot be solved in cache-feroceon-l2.c. In many cases you won't even get that far: - It will blow up in head.S when the cache is off: decompressor wrote writeback data to into the L2 and uncached fetches see memory content prior to decompression - It will blow up after head.S enables the L1: decompressor wrote data into the L2 but the relocation writes done with the L1 off made changes to memory that were not captured in the L2. So to solve it gracefully it has to be done before the decompressor enables the L1. Which leaves the bootloader, impedance matcher or decompressor startup code as possible places. The code required is a flush invalidate and then disable of the L2 cache, which is only about 5 instructions, but they must be conditionalized somehow on the CPU type :( The purpose of my patch is to highlight the problem to the bootloader people. If you luck out and successfully boot with the L2 cache enabled you can at least see the message that the bootloader is wrong. Jason
On Fri, Jan 10, 2014 at 12:44:37PM -0700, Jason Gunthorpe wrote: > On Fri, Jan 10, 2014 at 02:20:32PM -0500, Jason Cooper wrote: > > > Jason C: What do you think about applying this [untested]? > > > > I'm of the opinion that the kernel shouldn't assume the state of the > > system when it begins execution. Sure it's nice to point fingers at the > > bootloader (especially since it's obviously wrong in this situation), > > but we need Linux to boot on systems with non-upgradeable bootloaders > > which may be broken. I'm specifically thinking about hobbyists > > modifying boxes they bought at $bigbox stores. > > I liked the idea of your impedance matcher fixing up bootloader > problems. It would be pretty easy to have a feroceon specific path in > the impedance matcher to disable the L2. Agreed. > > So, I'd prefer to handle this more gracefully. I don't have much > > experience at the low-level init of the caches, couldn't we enable and > > flush rather than throwing the error? > > It cannot be solved in cache-feroceon-l2.c. > > In many cases you won't even get that far: > - It will blow up in head.S when the cache is off: decompressor wrote > writeback data to into the L2 and uncached fetches see memory > content prior to decompression > - It will blow up after head.S enables the L1: decompressor wrote > data into the L2 but the relocation writes done with the L1 off > made changes to memory that were not captured in the L2. > > So to solve it gracefully it has to be done before the decompressor > enables the L1. Which leaves the bootloader, impedance matcher or > decompressor startup code as possible places. > > The code required is a flush invalidate and then disable of the L2 > cache, which is only about 5 instructions, but they must be > conditionalized somehow on the CPU type :( And there's the problem. Even at my young LKML age, I know that's a non-starter. :) > The purpose of my patch is to highlight the problem to the bootloader > people. If you luck out and successfully boot with the L2 cache > enabled you can at least see the message that the bootloader is wrong. Sounds sane to me, feel free to add my Acked-by and put it in Russell's patch tracker. I'd also ask that you add the above summary to the commit message. I'm sure Russell doesn't need it, but it'll help someone doing a 'git blame ...' down the road. thx, Jason.
> > So, I'd prefer to handle this more gracefully. I don't have much > > experience at the low-level init of the caches, couldn't we enable and > > flush rather than throwing the error? > > It cannot be solved in cache-feroceon-l2.c. > > In many cases you won't even get that far: > - It will blow up in head.S when the cache is off: decompressor wrote > writeback data to into the L2 and uncached fetches see memory > content prior to decompression > - It will blow up after head.S enables the L1: decompressor wrote > data into the L2 but the relocation writes done with the L1 off > made changes to memory that were not captured in the L2. My impression with booting a lot of times while doing i bisect, was that it got as far as: Uncompressing Linux... done, booting the kernel. Booting Linux on physical CPU 0x0 Linux version 3.13.0-rc1-dirty (lunn@laptop) (gcc version 4.4.5 (Debian 4.4.5-8) ) #7 PREEMPT Fri Jan 3 09:25:07 CST 2014 CPU: Feroceon 88FR131 [56251311] revision 1 (ARMv5TE), cr=00053977 CPU: VIVT data cache, VIVT instruction cache Machine: Marvell Kirkwood (Flattened Device Tree), model: QNAP TS219 family Memory policy: ECC disabled, Data cache writeback every time. It frequently got as far as Built 1 zonelists in Zone order, mobility grouping on. Total pages: 130048 Kernel command line: root=/dev/sda2 console=ttyS0,115200 earlyprintk PID hash table entries: 2048 (order: 1, 8192 bytes) Dentry cache hash table entries: 65536 (order: 6, 262144 bytes) Inode-cache hash table entries: 32768 (order: 5, 131072 bytes) Memory: 512704K/524288K available (4693K kernel code, 247K rwdata, 1264K rodata, 154K init, 616K bss, 11584K reserved) Virtual kernel memory layout: vector : 0xffff0000 - 0xffff1000 ( 4 kB) fixmap : 0xfff00000 - 0xfffe0000 ( 896 kB) vmalloc : 0xe0800000 - 0xff000000 ( 488 MB) lowmem : 0xc0000000 - 0xe0000000 ( 512 MB) modules : 0xbf000000 - 0xc0000000 ( 16 MB) .text : 0xc0008000 - 0xc05d9678 (5958 kB) .init : 0xc05da000 - 0xc0600ab0 ( 155 kB) .data : 0xc0602000 - 0xc063fcc0 ( 248 kB) .bss : 0xc063fccc - 0xc06d9ef0 ( 617 kB) SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 Preemptible hierarchical RCU implementation. NR_IRQS:114 sched_clock: 32 bits at 200MHz, resolution 5ns, wraps every 21474ms And never made it past Console: colour dummy device 80x30 Calibrating delay loop... 1587.60 BogoMIPS (lpj=7938048) with L2 enabled. Now clearly there is a difference between how far it got before everything stops, and the point where corruption has occurred and unavoidable death is going to happen sometime in the future. It does however seem possible to insert platform specific code soon after it prints: Machine: Marvell Kirkwood (Flattened Device Tree), model: QNAP TS219 family but i never got around to trying to disable and flush L2 at that point. And the hardware is now also on the other side of the pond, and i have no way to remotely power cycle it. Andrew
On Sat, Jan 11, 2014 at 12:02:48AM +0100, Andrew Lunn wrote: > > > So, I'd prefer to handle this more gracefully. I don't have much > > > experience at the low-level init of the caches, couldn't we enable and > > > flush rather than throwing the error? > > > > It cannot be solved in cache-feroceon-l2.c. > > > > In many cases you won't even get that far: > > - It will blow up in head.S when the cache is off: decompressor wrote > > writeback data to into the L2 and uncached fetches see memory > > content prior to decompression > > - It will blow up after head.S enables the L1: decompressor wrote > > data into the L2 but the relocation writes done with the L1 off > > made changes to memory that were not captured in the L2. > > My impression with booting a lot of times while doing i bisect, was > that it got as far as: > > Uncompressing Linux... done, booting the kernel. > Booting Linux on physical CPU 0x0 > Linux version 3.13.0-rc1-dirty (lunn@laptop) (gcc version 4.4.5 (Debian 4.4.5-8) ) #7 PREEMPT Fri Jan 3 09:25:07 CST 2014 > CPU: Feroceon 88FR131 [56251311] revision 1 (ARMv5TE), cr=00053977 > CPU: VIVT data cache, VIVT instruction cache > Machine: Marvell Kirkwood (Flattened Device Tree), model: QNAP TS219 family > Memory policy: ECC disabled, Data cache writeback > > every time. It should be deterministic, but 'random' in the sense it is sensitive to exactly the data pattern generated by the decompressor and exactly the data pattern generated by the relocation writes - so the corruption will vary depending on the kernel image content (position of the relocations) and also the size of the L2. You are likely seeing the 2nd failure I predicted, which is a failure when a cache line containing a relocation is used, and happened to be in the L2. The first failure would require the kernel image to be smaller than the cache or the cache eviction policy to be such that the initial part of the decompressed image remains dirty and unevicted in the L2. > It does however seem possible to insert platform specific code soon > after it prints: > > Machine: Marvell Kirkwood (Flattened Device Tree), model: QNAP TS219 family So if you attempt to do the flush/disable at this point you are relying on the L2 containing no dirty lines that overlap with lines that were changed by the relocator while the cache was off. This is *probably* true in many cases, but it is sktechy, and creates a hidden hard to find pitfall. This is basically the same 'probably' that enables non-relocated kernels to boot. Basically, you might get a bootable system, but it is technically wrong and has a theoretical hole. Jason
On Fri, Jan 10, 2014 at 02:20:32PM -0500, Jason Cooper wrote: > I'm of the opinion that the kernel shouldn't assume the state of the > system when it begins execution. Rubbish. The kernel's requirements are well documented, and have been for years. The reason we've documented it is because it _matters_, and it's hard to sort out this crap before the kernel starts. Think about it - think about how many different L2 caches there are out there. Think about how difficult it is to locate the L2 cache controller when you don't know what platform you're running on. Think about how difficult it is to disable the L2 cache. Think about how difficult it is to flush the L2 cache, when you don't know what sort it is, where it's located, or what instructions to use to do that. The kernel cares about its boot environment beacuse of these issues.
On Sat, Jan 11, 2014 at 03:59:00PM +0000, Russell King - ARM Linux wrote: > On Fri, Jan 10, 2014 at 02:20:32PM -0500, Jason Cooper wrote: > > I'm of the opinion that the kernel shouldn't assume the state of the > > system when it begins execution. > > Rubbish. The kernel's requirements are well documented, and have been for > years. The reason we've documented it is because it _matters_, and it's > hard to sort out this crap before the kernel starts. > > Think about it - think about how many different L2 caches there are out > there. > > Think about how difficult it is to locate the L2 cache controller when > you don't know what platform you're running on. > > Think about how difficult it is to disable the L2 cache. > > Think about how difficult it is to flush the L2 cache, when you don't > know what sort it is, where it's located, or what instructions to use > to do that. > > The kernel cares about its boot environment beacuse of these issues. Agreed. My failure was applying the thought process for drivers (eg mv643xx_eth) to the core kernel code. With a network driver we are often in a situation where sometimes the clocks are gated, sometimes they aren't. For example, doing tftp boot during development, and nandboot in production. In that case, sometimes the bootloader sets things up and sometimes it doesn't depending on the bootloader environment. The mv643xx_eth driver can't assume the state that's handed to it. But as you and JasonG have clearly explained, that doesn't carry weight so early in the boot process for all the reasons you mentioned. thx, Jason.
diff --git a/arch/arm/mm/cache-feroceon-l2.c b/arch/arm/mm/cache-feroceon-l2.c index 48bc3c0..68781a4 100644 --- a/arch/arm/mm/cache-feroceon-l2.c +++ b/arch/arm/mm/cache-feroceon-l2.c @@ -331,7 +331,8 @@ static void __init enable_l2(void) enable_icache(); if (d) enable_dcache(); - } + } else + pr_err(FW_BUG "Feroceon L2: bootloader left the L2 cache on!\n"); } void __init feroceon_l2_init(int __l2_wt_override)