Message ID | 20120930232116.GC30637@obsidianresearch.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Sep 30, 2012 at 05:21:16PM -0600, Jason Gunthorpe wrote: > The standard linux asm-generic/vmlinux.lds.h already supports this, > and it seems other architectures do as well. > > The goal is to create an ELF file that has correct program headers. We > want to see the VirtAddr be the runtime address of the kernel with the > MMU turned on, and PhysAddr be the physical load address for the section > with no MMU. > > This allows ELF based boot loaders to properly load vmlinux: > > $ readelf -l vmlinux > Entry point 0x8000 > Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align > LOAD 0x008000 0xc0008000 0x00008000 0x372244 0x3a4310 RWE 0x8000 Not related directly to your patch, but I wonder why we don't we see separate r-x and rw- segments? Perhaps the linker script needs tidyup, or there are wrong attributes on some sections somewhere. > Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> > --- > arch/arm/include/asm/memory.h | 2 +- > arch/arm/kernel/vmlinux.lds.S | 47 ++++++++++++++++++++++++---------------- > 2 files changed, 29 insertions(+), 20 deletions(-) > > diff --git a/arch/arm/include/asm/memory.h b/arch/arm/include/asm/memory.h > index 5f6ddcc..4ce5b6d 100644 > --- a/arch/arm/include/asm/memory.h > +++ b/arch/arm/include/asm/memory.h > @@ -283,7 +283,7 @@ static inline __deprecated void *bus_to_virt(unsigned long x) > #define arch_is_coherent() 0 > #endif > > -#endif > +#endif /* __ASSEMBLY__ */ > > #include <asm-generic/memory_model.h> > > diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S > index 36ff15b..07942b6 100644 > --- a/arch/arm/kernel/vmlinux.lds.S > +++ b/arch/arm/kernel/vmlinux.lds.S > @@ -3,6 +3,13 @@ > * Written by Martin Mares <mj@atrey.karlin.mff.cuni.cz> > */ > > +/* If we have a known, fixed physical load address then set LOAD_OFFSET > + and generate an ELF that has the physical load address in the program > + headers. */ > +#ifndef CONFIG_ARM_PATCH_PHYS_VIRT > +#define LOAD_OFFSET (PAGE_OFFSET - PHYS_OFFSET) > +#endif > + What happens if CONFIG_ARM_PATCH_PHYS_VIRT=y? This will be used increasingly, especially for multiplatform kernels. Currently, it looks like we will just fail to link the kernel with ARM_PATCH_PHYS_VIRT... or have I overlooked something? If the kernel is intended to be loadable at a physical address which is not statically known, no ELF loader that does not ignore the ELF phdr address fields can work. There is no correct way to represent this situation using an ET_EXEC image, unless we specify explicitly that the addresses must be ignored as part of our boot protocol. (We ran into the same situation with uImages, which bake the load address into the image. The eventual answer was to fix U-Boot.) Really, LOAD_OFFSET (in your terminology) is something that has to be inferred by the loader in the general case, and in that case the ELF paddr might as well be the same as the vaddr. > #include <asm-generic/vmlinux.lds.h> > #include <asm/cache.h> > #include <asm/thread_info.h> > @@ -39,7 +46,7 @@ > #endif > > OUTPUT_ARCH(arm) > -ENTRY(stext) > +ENTRY(phys_start) This is debatable. In fact, stext has the property that its virtual (runtime) and load addresses are the same. To represent this correctly in the linker scripts, the position-independent head.S code should be split out into a separate section to which LOAD_OFFSET is not applied. This may cause confusion elsewhere though, since _text probably needs to coincide with the virtual image of stext. Setting vaddr and paddr to PAGE_OFFSET (as we do now) and having the loader choose the appropriate board-specific place to load the kernel image makes this irrelevant, if I've understood the situation correctly. Cheers ---Dave > > #ifndef __ARMEB__ > jiffies = jiffies_64; > @@ -86,11 +93,13 @@ SECTIONS > #else > . = PAGE_OFFSET + TEXT_OFFSET; > #endif > - .head.text : { > + .head.text : AT(ADDR(.head.text) - LOAD_OFFSET) { > _text = .; > + phys_start = . - LOAD_OFFSET; > HEAD_TEXT > } > - .text : { /* Real text segment */ > + /* Real text segment */ > + .text : AT(ADDR(.text) - LOAD_OFFSET) { > _stext = .; /* Text and read-only data */ > __exception_text_start = .; > *(.exception.text) > @@ -119,12 +128,12 @@ SECTIONS > * Stack unwinding tables > */ > . = ALIGN(8); > - .ARM.unwind_idx : { > + .ARM.unwind_idx : AT(ADDR(.ARM.unwind_idx) - LOAD_OFFSET) { > __start_unwind_idx = .; > *(.ARM.exidx*) > __stop_unwind_idx = .; > } > - .ARM.unwind_tab : { > + .ARM.unwind_tab : AT(ADDR(.ARM.unwind_tab) - LOAD_OFFSET) { > __start_unwind_tab = .; > *(.ARM.extab*) > __stop_unwind_tab = .; > @@ -139,35 +148,35 @@ SECTIONS > #endif > > INIT_TEXT_SECTION(8) > - .exit.text : { > + .exit.text : AT(ADDR(.exit.text) - LOAD_OFFSET) { > ARM_EXIT_KEEP(EXIT_TEXT) > } > - .init.proc.info : { > + .init.proc.info : AT(ADDR(.init.proc.info) - LOAD_OFFSET) { > ARM_CPU_DISCARD(PROC_INFO) > } > - .init.arch.info : { > + .init.arch.info : AT(ADDR(.init.arch.info) - LOAD_OFFSET) { > __arch_info_begin = .; > *(.arch.info.init) > __arch_info_end = .; > } > - .init.tagtable : { > + .init.tagtable : AT(ADDR(.init.tagtable) - LOAD_OFFSET) { > __tagtable_begin = .; > *(.taglist.init) > __tagtable_end = .; > } > #ifdef CONFIG_SMP_ON_UP > - .init.smpalt : { > + .init.smpalt : AT(ADDR(.init.smpalt) - LOAD_OFFSET) { > __smpalt_begin = .; > *(.alt.smp.init) > __smpalt_end = .; > } > #endif > - .init.pv_table : { > + .init.pv_table : AT(ADDR(.init.pv_table) - LOAD_OFFSET) { > __pv_table_begin = .; > *(.pv_table) > __pv_table_end = .; > } > - .init.data : { > + .init.data : AT(ADDR(.init.data) - LOAD_OFFSET) { > #ifndef CONFIG_XIP_KERNEL > INIT_DATA > #endif > @@ -178,7 +187,7 @@ SECTIONS > INIT_RAM_FS > } > #ifndef CONFIG_XIP_KERNEL > - .exit.data : { > + .exit.data : AT(ADDR(.exit.data) - LOAD_OFFSET) { > ARM_EXIT_KEEP(EXIT_DATA) > } > #endif > @@ -196,7 +205,7 @@ SECTIONS > __data_loc = .; > #endif > > - .data : AT(__data_loc) { > + .data : AT(__data_loc - LOAD_OFFSET) { > _data = .; /* address in memory */ > _sdata = .; > > @@ -245,7 +254,7 @@ SECTIONS > * free it after init has commenced and TCM contents have > * been copied to its destination. > */ > - .tcm_start : { > + .tcm_start : AT(ADDR(.tcm_start) - LOAD_OFFSET) { > . = ALIGN(PAGE_SIZE); > __tcm_start = .; > __itcm_start = .; > @@ -257,7 +266,7 @@ SECTIONS > * and we'll upload the contents from RAM to TCM and free > * the used RAM after that. > */ > - .text_itcm ITCM_OFFSET : AT(__itcm_start) > + .text_itcm ITCM_OFFSET : AT(__itcm_start - LOAD_OFFSET) > { > __sitcm_text = .; > *(.tcm.text) > @@ -272,12 +281,12 @@ SECTIONS > */ > . = ADDR(.tcm_start) + SIZEOF(.tcm_start) + SIZEOF(.text_itcm); > > - .dtcm_start : { > + .dtcm_start : AT(ADDR(.dtcm_start) - LOAD_OFFSET) { > __dtcm_start = .; > } > > /* TODO: add remainder of ITCM as well, that can be used for data! */ > - .data_dtcm DTCM_OFFSET : AT(__dtcm_start) > + .data_dtcm DTCM_OFFSET : AT(__dtcm_start - LOAD_OFFSET) > { > . = ALIGN(4); > __sdtcm_data = .; > @@ -290,7 +299,7 @@ SECTIONS > . = ADDR(.dtcm_start) + SIZEOF(.data_dtcm); > > /* End marker for freeing TCM copy in linked object */ > - .tcm_end : AT(ADDR(.dtcm_start) + SIZEOF(.data_dtcm)){ > + .tcm_end : AT(ADDR(.dtcm_start) + SIZEOF(.data_dtcm) - LOAD_OFFSET){ > . = ALIGN(PAGE_SIZE); > __tcm_end = .; > } > -- > 1.7.4.1 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Mon, Oct 01, 2012 at 04:39:53PM +0100, Dave Martin wrote: > > Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align > > LOAD 0x008000 0xc0008000 0x00008000 0x372244 0x3a4310 RWE 0x8000 > > Not related directly to your patch, but I wonder why we don't we see > separate r-x and rw- segments? I think this is because the sections are not aligned when the protections change, and the sections are not sorted by protection type. [ 1] .head.text PROGBITS c0008000 008000 0001c0 00 AX 0 0 32 [ 2] .text PROGBITS c00081c0 0081c0 1dc880 00 AX 0 0 32 [ 3] .rodata PROGBITS c01e5000 1e5000 060a80 00 A 0 0 32 [ 4] __bug_table PROGBITS c0245a80 245a80 002a78 00 A 0 0 1 [ 5] .pci_fixup PROGBITS c02484f8 2484f8 000090 00 A 0 0 4 [ 6] __param PROGBITS c0248588 248588 000280 00 A 0 0 4 [ 7] __modver PROGBITS c0248808 248808 0007f8 00 A 0 0 4 [ 8] .init.text PROGBITS c0249000 249000 01af20 00 AX 0 0 32 [ 9] .exit.text PROGBITS c0263f20 263f20 000c0c 00 AX 0 0 4 [10] .init.proc.info PROGBITS c0264b2c 264b2c 00009c 00 AX 0 0 1 [11] .init.arch.info PROGBITS c0264bc8 264bc8 000044 00 A 0 0 4 [12] .init.tagtable PROGBITS c0264c0c 264c0c 000040 00 A 0 0 4 [13] .init.data PROGBITS c0264c60 264c60 0f481c 00 WA 0 0 32 [14] .data PROGBITS c035a000 35a000 020220 00 WA 0 0 32 [15] .notes NOTE c037a220 37a220 000024 00 AX 0 0 4 [16] .bss NOBITS c037a260 37a244 0320b0 00 WA 0 0 32 [17] .comment PROGBITS 00000000 37a244 000021 01 MS 0 0 1 > > +/* If we have a known, fixed physical load address then set LOAD_OFFSET > > + and generate an ELF that has the physical load address in the program > > + headers. */ > > +#ifndef CONFIG_ARM_PATCH_PHYS_VIRT > > +#define LOAD_OFFSET (PAGE_OFFSET - PHYS_OFFSET) > > +#endif > > + > > What happens if CONFIG_ARM_PATCH_PHYS_VIRT=y? This will be used > increasingly, especially for multiplatform kernels. Then LOAD_OFFSET is unset and include/asm-generic/vmlinux.lds.h does: #ifndef LOAD_OFFSET #define LOAD_OFFSET 0 #endif Which is exactly the same case we have today. LOAD_OFFSET is not a name I invented, it is the standard kernel name for the functionality. > If the kernel is intended to be loadable at a physical address which is > not statically known, no ELF loader that does not ignore the ELF > phdr In this case you can't really use a standard ELF loader to load the kernel so, LOAD_OFFSET = 0 is fine. My case is using an ELF loader, and I have set options for a static physical load address. > > OUTPUT_ARCH(arm) > > -ENTRY(stext) > > +ENTRY(phys_start) > > This is debatable. In fact, stext has the property that its virtual > (runtime) and load addresses are the same. This is what other arches using LOAD_OFFSET do (eg see ia64), and is sensible. ENTRY is *only* used by the ELF loader and specifies where the loader should jump. It must be a physical address since the loader only works with physical addresse. I don't understand your comment that .stext has the same load and virtual address, that is clearly not true for the ELF images my build is producing: c0008000 T _text c0008000 T stext The physical address of that symbol is 0x8000. > To represent this correctly in the linker scripts, the > position-independent head.S code should be split out into a separate > section to which LOAD_OFFSET is not applied. I'm not sure, the linker script should use addresses that are correct during runtime, and if the PIC code does not care about its PC then it is fine to keep it at the virtual address to minimize confusion once the MMU is on. > Setting vaddr and paddr to PAGE_OFFSET (as we do now) and having the > loader choose the appropriate board-specific place to load the kernel > image makes this irrelevant, if I've understood the situation correctly. Yes, if you use more loader stages then the load headers are ignored. Our boot loaders on our boards boot straight ELF vmlinux.gz so they need correct load headers. Jason
On Mon, Oct 01, 2012 at 10:06:39AM -0600, Jason Gunthorpe wrote: > On Mon, Oct 01, 2012 at 04:39:53PM +0100, Dave Martin wrote: > > > > Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align > > > LOAD 0x008000 0xc0008000 0x00008000 0x372244 0x3a4310 RWE 0x8000 > > > > Not related directly to your patch, but I wonder why we don't we see > > separate r-x and rw- segments? > > I think this is because the sections are not aligned when the > protections change, and the sections are not sorted by protection > type. > > [ 1] .head.text PROGBITS c0008000 008000 0001c0 00 AX 0 0 32 > [ 2] .text PROGBITS c00081c0 0081c0 1dc880 00 AX 0 0 32 > [ 3] .rodata PROGBITS c01e5000 1e5000 060a80 00 A 0 0 32 > [ 4] __bug_table PROGBITS c0245a80 245a80 002a78 00 A 0 0 1 > [ 5] .pci_fixup PROGBITS c02484f8 2484f8 000090 00 A 0 0 4 > [ 6] __param PROGBITS c0248588 248588 000280 00 A 0 0 4 > [ 7] __modver PROGBITS c0248808 248808 0007f8 00 A 0 0 4 > [ 8] .init.text PROGBITS c0249000 249000 01af20 00 AX 0 0 32 > [ 9] .exit.text PROGBITS c0263f20 263f20 000c0c 00 AX 0 0 4 > [10] .init.proc.info PROGBITS c0264b2c 264b2c 00009c 00 AX 0 0 1 > [11] .init.arch.info PROGBITS c0264bc8 264bc8 000044 00 A 0 0 4 > [12] .init.tagtable PROGBITS c0264c0c 264c0c 000040 00 A 0 0 4 > [13] .init.data PROGBITS c0264c60 264c60 0f481c 00 WA 0 0 32 > [14] .data PROGBITS c035a000 35a000 020220 00 WA 0 0 32 > [15] .notes NOTE c037a220 37a220 000024 00 AX 0 0 4 > [16] .bss NOBITS c037a260 37a244 0320b0 00 WA 0 0 32 > [17] .comment PROGBITS 00000000 37a244 000021 01 MS 0 0 1 > > > > > +/* If we have a known, fixed physical load address then set LOAD_OFFSET > > > + and generate an ELF that has the physical load address in the program > > > + headers. */ > > > +#ifndef CONFIG_ARM_PATCH_PHYS_VIRT > > > +#define LOAD_OFFSET (PAGE_OFFSET - PHYS_OFFSET) > > > +#endif > > > + > > > > What happens if CONFIG_ARM_PATCH_PHYS_VIRT=y? This will be used > > increasingly, especially for multiplatform kernels. > > Then LOAD_OFFSET is unset and include/asm-generic/vmlinux.lds.h does: > > #ifndef LOAD_OFFSET > #define LOAD_OFFSET 0 > #endif OK, good. > Which is exactly the same case we have today. LOAD_OFFSET is not a > name I invented, it is the standard kernel name for the functionality. > > > If the kernel is intended to be loadable at a physical address which is > > not statically known, no ELF loader that does not ignore the ELF > > phdr > > In this case you can't really use a standard ELF loader to load the > kernel so, LOAD_OFFSET = 0 is fine. My case is using an ELF loader, > and I have set options for a static physical load address. Generally, people should try to be compatible with the single kernel image effort unless there's a really compelling reason not to. Wouldn't your firmware be incapable of loading a multiplatform kernel? > > > OUTPUT_ARCH(arm) > > > -ENTRY(stext) > > > +ENTRY(phys_start) > > > > This is debatable. In fact, stext has the property that its virtual > > (runtime) and load addresses are the same. > > This is what other arches using LOAD_OFFSET do (eg see ia64), and is > sensible. ENTRY is *only* used by the ELF loader and specifies where > the loader should jump. It must be a physical address since the loader > only works with physical addresse. I don't understand your comment > that .stext has the same load and virtual address, that is clearly not > true for the ELF images my build is producing: > > c0008000 T _text > c0008000 T stext > > The physical address of that symbol is 0x8000. Well, that was a bit of a pedantic point I admit, but there are conflicting definitions of what "virtual address" really means in these situations. The original SYSV ABI spec explicitly specifies that e_entry is a virtual address, but is also rather vague about how the paddr fields should be interpreted. All that AT(ADDR(blah) - LOAD_OFFSET) stuff is cumbersome, but if it's at least consistent with other architectures then it may not such a disaster. It's not universal though: less than 50% of the arches in the kernel currently seem to use this. An alternative, cleaner approach might be to specify segment load addresses directly using PHDRS { }. This should at leat allow the load address to be specified once per phdr, rather than once per section. > > To represent this correctly in the linker scripts, the > > position-independent head.S code should be split out into a separate > > section to which LOAD_OFFSET is not applied. > > I'm not sure, the linker script should use addresses that are correct > during runtime, and if the PIC code does not care about its PC then it > is fine to keep it at the virtual address to minimize confusion once > the MMU is on. > > Setting vaddr and paddr to PAGE_OFFSET (as we do now) and having the > > loader choose the appropriate board-specific place to load the kernel > > image makes this irrelevant, if I've understood the situation correctly. > > Yes, if you use more loader stages then the load headers are ignored. > Our boot loaders on our boards boot straight ELF vmlinux.gz so they > need correct load headers. If your image is compressed anyway though, why are you not using zImage? Cheers ---Dave
On Mon, Oct 01, 2012 at 06:56:47PM +0100, Dave Martin wrote: > > > If the kernel is intended to be loadable at a physical address which is > > > not statically known, no ELF loader that does not ignore the ELF > > > phdr > > > > In this case you can't really use a standard ELF loader to load the > > kernel so, LOAD_OFFSET = 0 is fine. My case is using an ELF loader, > > and I have set options for a static physical load address. > > Generally, people should try to be compatible with the single kernel > image effort unless there's a really compelling reason not to. Well, all the embedded kernels we use are always custom built and minimized for the target. So as long as there are options to minimize the kernel size/increase performance by taking out the relocation stuff, we are going to use them. All our boards on PPC and ARM use DT kernels now, and we try to flow back all the generic stuff as best we can. As someone that makes custom boards, I really like DT, it makes things much easier :) > Wouldn't your firmware be incapable of loading a multiplatform kernel? Well, no, it boots ELFs, so it can boot anything, with any memory layout. A 2nd stage loader would be required to boot standard kernels, that loader would be an ELF with 1 section for the 2nd stage, 1 section for the zImage and 1 section for the initrd, with proper load headers. Creating such a system is a lot of annoyance, so we never have - it is *so much* easier to just boot vmlinux ELF directly. > Well, that was a bit of a pedantic point I admit, but there are > conflicting definitions of what "virtual address" really means in these > situations. The original SYSV ABI spec explicitly specifies that > e_entry is a virtual address, but is also rather vague about how the > paddr fields should be interpreted. Granted the spec is vauge, but convention for loaders seems to be that it is a physical address these days. It could be a virtual address, and the loader could translate it by looking at the phdrs, but I don't see any other arches doing that? > All that AT(ADDR(blah) - LOAD_OFFSET) stuff is cumbersome, but if it's > at least consistent with other architectures then it may not such a > disaster. It's not universal though: less than 50% of the arches in > the kernel currently seem to use this. I agree it is not nice, but I once did try to make PHDRS work as you described, but was never successful. IIRC there were serious linker bugs) As you note the AT method is consistent with other arches, and the generic vmlinux.lds.h > > Yes, if you use more loader stages then the load headers are ignored. > > Our boot loaders on our boards boot straight ELF vmlinux.gz so they > > need correct load headers. > If your image is compressed anyway though, why are you not using zImage? We store the kernel in a CRAMFS, the loader pulls it out and decompresses it, processes the ELF sections 'on the fly' and jumps to it. Using zImage would result in double-decompression, and reallly has no benefits to us. Jason
On Mon, Oct 01, 2012 at 12:35:43PM -0600, Jason Gunthorpe wrote: > On Mon, Oct 01, 2012 at 06:56:47PM +0100, Dave Martin wrote: > > > > > If the kernel is intended to be loadable at a physical address which is > > > > not statically known, no ELF loader that does not ignore the ELF > > > > phdr > > > > > > In this case you can't really use a standard ELF loader to load the > > > kernel so, LOAD_OFFSET = 0 is fine. My case is using an ELF loader, > > > and I have set options for a static physical load address. > > > > Generally, people should try to be compatible with the single kernel > > image effort unless there's a really compelling reason not to. > > Well, all the embedded kernels we use are always custom built and > minimized for the target. So as long as there are options to minimize > the kernel size/increase performance by taking out the relocation > stuff, we are going to use them. > > All our boards on PPC and ARM use DT kernels now, and we try to flow > back all the generic stuff as best we can. As someone that makes > custom boards, I really like DT, it makes things much easier :) > > > Wouldn't your firmware be incapable of loading a multiplatform kernel? > > Well, no, it boots ELFs, so it can boot anything, with any memory > layout. A 2nd stage loader would be required to boot standard kernels, > that loader would be an ELF with 1 section for the 2nd stage, 1 > section for the zImage and 1 section for the initrd, with proper load > headers. Don't you already have to treat Linux as a special case though? How do you know where to load ATAGs, DT and/or initramfs, and how to initalise the registers? None of that is part of any ELF specification, and would be inappropriate if you boot any non-linux images. > Creating such a system is a lot of annoyance, so we never have - it is > *so much* easier to just boot vmlinux ELF directly. > > > Well, that was a bit of a pedantic point I admit, but there are > > conflicting definitions of what "virtual address" really means in these > > situations. The original SYSV ABI spec explicitly specifies that > > e_entry is a virtual address, but is also rather vague about how the > > paddr fields should be interpreted. > > Granted the spec is vauge, but convention for loaders seems to be that > it is a physical address these days. > > It could be a virtual address, and the loader could translate it by > looking at the phdrs, but I don't see any other arches doing that? You would just give .head.text a virtual address matching its load address. But as you say, no other arches bother with this, and it's not obviously worthwhile. > > All that AT(ADDR(blah) - LOAD_OFFSET) stuff is cumbersome, but if it's > > at least consistent with other architectures then it may not such a > > disaster. It's not universal though: less than 50% of the arches in > > the kernel currently seem to use this. > > I agree it is not nice, but I once did try to make PHDRS work as you > described, but was never successful. IIRC there were serious linker > bugs) As you note the AT method is consistent with other arches, and > the generic vmlinux.lds.h A quick experiment shows that PHDRS { kernel PT_LOAD AT(PHYS_OFFSET + LOAD_OFFSET); } /* ... */ SECTIONS { .head.text { /* ... */ } :kernel /* ... */ } can produce a sensible-looking vmlinux at least with my version of the tools. As you observe, GNU ld behaviour in this area tends to be rather patchily specified, buggy or both. That does argue in favour of reusing the same techniques already used for other arches, though. A question does occur to me: do your changes work with XIP_KERNEL? I'm not very familiar with XIP_KERNEL myself, so I'm currently not clear on whether there's an impact here. Beyond this, I think the approach doesn't look unreasonable. > > > Yes, if you use more loader stages then the load headers are ignored. > > > Our boot loaders on our boards boot straight ELF vmlinux.gz so they > > > need correct load headers. > > > If your image is compressed anyway though, why are you not using zImage? > > We store the kernel in a CRAMFS, the loader pulls it out and > decompresses it, processes the ELF sections 'on the fly' and jumps to > it. Using zImage would result in double-decompression, and reallly has > no benefits to us. You store vmlinux.gz in a cramfs? Is that a typo, or have you already compressed the kernel twice? Cheers ---Dave
On Tue, Oct 02, 2012 at 11:23:46AM +0100, Dave Martin wrote: > > Well, no, it boots ELFs, so it can boot anything, with any memory > > layout. A 2nd stage loader would be required to boot standard kernels, > > that loader would be an ELF with 1 section for the 2nd stage, 1 > > section for the zImage and 1 section for the initrd, with proper load > > headers. > > Don't you already have to treat Linux as a special case though? How > do you know where to load ATAGs, DT and/or initramfs, and how to > initalise the registers? None of that is part of any ELF specification, > and would be inappropriate if you boot any non-linux images. Experience with our PPC systems has been that linking the DTB into vmlinux is the way to go - so we have a trivally small, non-upstreamble change (for PPC and ARM) to put the DTB(s) in vmlinux and capture the CPU registers values (entry call function arguments) that are set from the bootloader. Some DTB fixup code (supported by the OF kernel routines) in vmlinux edits the DTB chosen node to include the information from the bootloader. From there any other DTB fixups (eg fetching a MAC address from I2C) plus other register initializations are done in Linux (typically by the upstream drivers, though not 100%), with the MMU on, with full kernel services. This is a much easier development environment :) All in all it is basically about 100 lines of code to do what I've described in vmlinux. This is dramatically less code than would be required if we tried to conform to the standard vmlinux boot protocol. > As you observe, GNU ld behaviour in this area tends to be rather patchily > specified, buggy or both. That does argue in favour of reusing the > same techniques already used for other arches, though. It is good to know PHDRS seems to be working better now, maybe things can migrate someday! > A question does occur to me: do your changes work with XIP_KERNEL? > I'm not very familiar with XIP_KERNEL myself, so I'm currently not > clear on whether there's an impact here. I looked at the XIP defines and they didn't seem to conflict. Since this change only effects the values in the LOAD headers, not the actual image I doubt it has any impact. > Beyond this, I think the approach doesn't look unreasonable. Great, should I do anything further to get this patch into mainline. > You store vmlinux.gz in a cramfs? Is that a typo, or have you already > compressed the kernel twice? The compression can either be intrisic to cramfs or a raw elf that has been gzip'd. Jason
[Nicolas, do you have a view on this thread with regard to XIP?] Hi, On Tue, Oct 02, 2012 at 11:47:59AM -0600, Jason Gunthorpe wrote: > On Tue, Oct 02, 2012 at 11:23:46AM +0100, Dave Martin wrote: > > > > Well, no, it boots ELFs, so it can boot anything, with any memory > > > layout. A 2nd stage loader would be required to boot standard kernels, > > > that loader would be an ELF with 1 section for the 2nd stage, 1 > > > section for the zImage and 1 section for the initrd, with proper load > > > headers. > > > > Don't you already have to treat Linux as a special case though? How > > do you know where to load ATAGs, DT and/or initramfs, and how to > > initalise the registers? None of that is part of any ELF specification, > > and would be inappropriate if you boot any non-linux images. > > Experience with our PPC systems has been that linking the DTB into > vmlinux is the way to go - so we have a trivally small, > non-upstreamble change (for PPC and ARM) to put the DTB(s) in vmlinux > and capture the CPU registers values (entry call function arguments) > that are set from the bootloader. > > Some DTB fixup code (supported by the OF kernel routines) in vmlinux > edits the DTB chosen node to include the information from the > bootloader. > > From there any other DTB fixups (eg fetching a MAC address from I2C) > plus other register initializations are done in Linux (typically by > the upstream drivers, though not 100%), with the MMU on, with full > kernel services. This is a much easier development environment :) > > All in all it is basically about 100 lines of code to do what I've > described in vmlinux. This is dramatically less code than would be > required if we tried to conform to the standard vmlinux boot protocol. [1] I'm not sure exactly what you mean by linking the DTB into vmlinux. I don't think this is supported upstream, at least for ARM. It could be done externally by post-processing vmlinux to add extra sections and some boot shim code which initialises the registers appropriately for kernel entry ... but you're really inventing a bootloader if you start to do that. For ARM, I believe that bootloaders that do not pass the dtb properly are considered considered legacy, or broken (in the case of platforms which are DT-based). We don't really attempt to support this. ARM_APPENDED_DTB is the workaround for that, but that is only available as part of the zImage loader. This is meant to ease migration, not as a permanent solution... but it will probably be available for quite a while yet. Making a simple bootloader DT-aware is not actually that hard: libfdt is intentionally pretty straightforward to integrate. For an example of how this might look in a simple scenario, you could take a look at the zImage loader implementation, or git://git.linaro.org/arm/models/boot-wrapper.git (Particularly update_fdt() and load_kernel() in semi_loader.c) It's more than 100 lines, but not _that_ much more, and contains some implementation features you probably don't need. There may still be an argument against this if your bootloader is exteremely space-constrained. > > As you observe, GNU ld behaviour in this area tends to be rather patchily > > specified, buggy or both. That does argue in favour of reusing the > > same techniques already used for other arches, though. > > It is good to know PHDRS seems to be working better now, maybe things > can migrate someday! I guess we shouldn't rush to do that unnecessarily. Old tools hang around for a long, long time... > > A question does occur to me: do your changes work with XIP_KERNEL? > > I'm not very familiar with XIP_KERNEL myself, so I'm currently not > > clear on whether there's an impact here. > > I looked at the XIP defines and they didn't seem to conflict. Since > this change only effects the values in the LOAD headers, not the > actual image I doubt it has any impact. > > > Beyond this, I think the approach doesn't look unreasonable. > > Great, should I do anything further to get this patch into mainline. I think that someone more familiar with XIP should at least comment on it. Since you admit [1] that you are deliberately not following the boot protocol proper, you may get some pushback about whether the change is justified. From my side, I think that the fact this takes arch/arm/vmlinux.lds.S closer to some other common arches means that this is low risk and should be mostly harmless, but I will take another look at the patch first. > > You store vmlinux.gz in a cramfs? Is that a typo, or have you already > > compressed the kernel twice? > > The compression can either be intrisic to cramfs or a raw elf that has > been gzip'd. If you can have files with intrinsic compression such that they don't get double-compressed, why can you not do this for zImage? Am I still misunderstanding something? Cheers ---Dave
On Wed, Oct 03, 2012 at 11:43:35AM +0100, Dave Martin wrote: > I'm not sure exactly what you mean by linking the DTB into vmlinux. > I don't think this is supported upstream, at least for ARM. It could > be done externally by post-processing vmlinux to add extra sections > and some boot shim code which initialises the registers appropriately > for kernel entry ... but you're really inventing a bootloader if you > start to do that. We use the existing build infrastructure (see cmd_dt_S_dtb) and add a 4 line patch to head.S to put the dtb address into the right register. I think it would be great to have a CONFIG_ARM_BUILT_IN_DTB for applications like mine, but it is really no problem to carry the head.S patch. The normal build process produces a vmlinux that is self contained, no post processing required. > Making a simple bootloader DT-aware is not actually that hard: libfdt > is intentionally pretty straightforward to integrate. Sure, but we have several major issues with that entire idea: 1) Our bootloader (on PPC and ARM) is in redundant NOR flash and is limited to 16K. That is just enough to do gzip, elf, cramfs and tftpv6 - there is no space left for dtb or fdt code. 2) Experience using DT with PPC has shown that the vmlinux is very sensitive to the DTB. Every DTB/vmlinux combination must be tested to ensure it works. By far the best way to manage this *for us* is to guarentee that the vmlinux sees only the DTB it was designed for by including it in vmlinux. 3) The DTB *changes*. We change it because we've changed things in the programmable part of our environment and we change it because the kernel has shifted. Planning for this inevitable change is necessary. 4) On our ARM platforms updating the boot loader in the field is more challenging than updating the vmlinux, not impossible, but not something we want to do every day. Our PPC boot loader has survived without change since 2007, and has booted kernels from 2.6.12 to 3.6 with and without DT. This is a time tested, field proven approach. Fundamentally we treat the DTB like the .config - an integral, internal, part of vmlinux. As far as I can see, the *only* argument to put the DT in the boot loader is to support standard, distribution kernels. > It's more than 100 lines, but not _that_ much more, and contains > some implementation features you probably don't need. But that is not enough, variations of our platforms use bit bang I2C, SPI and NOR FLASH to store things like the MAC address and other DT-relevant information. So, swaths of that code would need to go into the boot loader/zImage as well. At that point we may as well take our chances with uBoot. :) > From my side, I think that the fact this takes arch/arm/vmlinux.lds.S > closer to some other common arches means that this is low risk and should > be mostly harmless, but I will take another look at the patch first. This matches my opinion.. > > > You store vmlinux.gz in a cramfs? Is that a typo, or have you already > > > compressed the kernel twice? > > > > The compression can either be intrisic to cramfs or a raw elf that has > > been gzip'd. > > If you can have files with intrinsic compression such that they don't > get double-compressed, why can you not do this for zImage? > > Am I still misunderstanding something? We have two boot modes, one boots ELF.gz (for development), the other boots ELF in CRAMFS (for deployment). Both are compressed, one is compressed using gzip and one is compressed using zlib in CRAMFS. It is hard to see how to store a zImage in CRAMFS and avoid double de-compression. We never store an ELF.gz in CRAMFS. Since the entire point of zImage is to introduce compression it is hard to see *why* we'd want to store a zImage in a CRAMFS. Jason
On Wed, Oct 03, 2012 at 12:44:38PM -0600, Jason Gunthorpe wrote: > On Wed, Oct 03, 2012 at 11:43:35AM +0100, Dave Martin wrote: > > > I'm not sure exactly what you mean by linking the DTB into vmlinux. > > I don't think this is supported upstream, at least for ARM. It could > > be done externally by post-processing vmlinux to add extra sections > > and some boot shim code which initialises the registers appropriately > > for kernel entry ... but you're really inventing a bootloader if you > > start to do that. > > We use the existing build infrastructure (see cmd_dt_S_dtb) and add a > 4 line patch to head.S to put the dtb address into the right > register. I think it would be great to have a CONFIG_ARM_BUILT_IN_DTB > for applications like mine, but it is really no problem to carry the > head.S patch. The normal build process produces a vmlinux that is self > contained, no post processing required. OK, so it is supported, but not for ARM, yet. I'm not sure that such a patch would be rejected, since building in a DTB is not really that different from building in a command-line. The post-processing that would otherwise be needed is likely to be pretty trivial, though (see example below). > > Making a simple bootloader DT-aware is not actually that hard: libfdt > > is intentionally pretty straightforward to integrate. > > Sure, but we have several major issues with that entire idea: > 1) Our bootloader (on PPC and ARM) is in redundant NOR flash and > is limited to 16K. That is just enough to do gzip, elf, cramfs > and tftpv6 - there is no space left for dtb or fdt code. > 2) Experience using DT with PPC has shown that the vmlinux is very > sensitive to the DTB. Every DTB/vmlinux combination must be tested > to ensure it works. By far the best way to manage this *for us* is > to guarentee that the vmlinux sees only the DTB it was designed > for by including it in vmlinux. > 3) The DTB *changes*. We change it because we've changed things in the > programmable part of our environment and we change it because the > kernel has shifted. Planning for this inevitable change is > necessary. > 4) On our ARM platforms updating the boot loader in the field is more > challenging than updating the vmlinux, not impossible, but not > something we want to do every day. Our PPC boot loader has > survived without change since 2007, and has booted kernels from > 2.6.12 to 3.6 with and without DT. This is a time tested, > field proven approach. > > Fundamentally we treat the DTB like the .config - an integral, > internal, part of vmlinux. > > As far as I can see, the *only* argument to put the DT in the boot > loader is to support standard, distribution kernels. DT-aware does not have to mean that the DTB is part of the bootloader. Generally, that's considered to be a bad idea for the reasons you describe (among others). The other solution to this problem is to distribute the dtb alongside the kernel as a supplementary image (similar to initramfs), with the bootloader doing the bare minimum of modifications to it. Naturally, if zero modifications are needed then the bootloader does not need libfdt (or equivalent code) at all. > > It's more than 100 lines, but not _that_ much more, and contains > > some implementation features you probably don't need. > > But that is not enough, variations of our platforms use bit bang I2C, > SPI and NOR FLASH to store things like the MAC address and other > DT-relevant information. So, swaths of that code would need to go into > the boot loader/zImage as well. > > At that point we may as well take our chances with uBoot. :) > > > From my side, I think that the fact this takes arch/arm/vmlinux.lds.S > > closer to some other common arches means that this is low risk and should > > be mostly harmless, but I will take another look at the patch first. > > This matches my opinion.. > > > > > You store vmlinux.gz in a cramfs? Is that a typo, or have you already > > > > compressed the kernel twice? > > > > > > The compression can either be intrisic to cramfs or a raw elf that has > > > been gzip'd. > > > > If you can have files with intrinsic compression such that they don't > > get double-compressed, why can you not do this for zImage? > > > > Am I still misunderstanding something? > > We have two boot modes, one boots ELF.gz (for development), the other > boots ELF in CRAMFS (for deployment). Both are compressed, one is > compressed using gzip and one is compressed using zlib in CRAMFS. It > is hard to see how to store a zImage in CRAMFS and avoid double > de-compression. We never store an ELF.gz in CRAMFS. OK, now I understand. > Since the entire point of zImage is to introduce compression it is > hard to see *why* we'd want to store a zImage in a CRAMFS. Because zImage has grown one or two extra features (which in fact have nothing to do with compression). But I can see why you're not keen on the idea. In principle we could have an uncompressed zImage, but for now this doesn't exist. One other option open to you would be to provide an ELF image loadable by your bootloader via a simple wrapper. This would probably be a more robust approach than carrying patches against head.S, since it removes any intimate dependency on the kernel code itself. ---8<--- boot.lds.S --- OUTPUT_FORMAT(elf32-littlearm) TARGET(binary) INPUT(Image) INPUT(dtb) TARGET(elf32-littlearm) ENTRY(__entry) SECTIONS { .text PHYS_OFFSET : { boot.o(.text) } kernel PHYS_OFFSET + TEXT_OFFSET : { __kernel_entry = .; Image(*) payload PHYS_OFFSET + PAYLOAD_OFFSET : { __dtb_start = .; dtb(*) } /DISCARD/ : { *(*) } } --- boot.S --- .globl __entry .type __entry, %function __entry: mov r0, #0 mov r1, #0xFFFFFFFF ldr r2, =__dtb_start b __kernel_entry --->8--- Where PHYS_OFFSET and TEXT_OFFSET are the appropriate definitions for your platform, and PAYLOAD_OFFSET is an appropriate safe location for the dtb. The resulting image can be a bit smaller than vmlinux, too (but only slightly). ---Dave
On Thu, Oct 04, 2012 at 12:36:37PM +0100, Dave Martin wrote: > OK, so it is supported, but not for ARM, yet. I'm not sure that > such a patch would be rejected, since building in a DTB is not > really that different from building in a command-line. Maybe I will be able to look at this in a few months.. > The other solution to this problem is to distribute the dtb > alongside the kernel as a supplementary image (similar to > initramfs), with the bootloader doing the bare minimum of > modifications to it. Yes, but we still need rely on complex code like I2C/MTD to create a correct DTB, which again puts us back to patching the kernel for that functionality. > Naturally, if zero modifications are needed then the bootloader does not > need libfdt (or equivalent code) at all. If only :) > One other option open to you would be to provide an ELF image loadable > by your bootloader via a simple wrapper. This would probably be a more > robust approach than carrying patches against head.S, since it removes > any intimate dependency on the kernel code itself. This is what I ment by having a 2nd stage loader, but this example is not complete because at this point you also have to patch into the DTB the initrd address (passed in r0/r1), and the various MAC addresses (getting them is the hard part..). The nice thing about the head.S patch is that it lives right after _entry, so there actually are no dependencies on the kernel code, it executes in the same environment as the example you presented. Anyhow, exploring a CONFIG_ARM_BUILT_IN_DTB.. I think I'd add something like this to head.S: #ifdef CONFIG_ARM_BUILT_IN_DTB adr r3,bootloader_r0 stmia r3,{r0,r1,r2} mov r2, #0 mov r1, #0 mov r0, #0 #else bl __vet_atags #endif And something like this to early_dt_fixup: #ifdef CONFIG_ARM_BUILT_IN_DTB for_each_dt_provider(dtp) { devtree = dtp->get_dtb(); if (devtree) break; } #else devtree = phys_to_virt(dt_phys); #endif Where the 'dev tree provider' would use the stored bootloader registers and any other information to return the proper DTB. Jason
On Mon, Oct 01, 2012 at 10:06:39AM -0600, Jason Gunthorpe wrote: > On Mon, Oct 01, 2012 at 04:39:53PM +0100, Dave Martin wrote: > > > Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align > > > LOAD 0x008000 0xc0008000 0x00008000 0x372244 0x3a4310 RWE 0x8000 > > > > Not related directly to your patch, but I wonder why we don't we see > > separate r-x and rw- segments? > > I think this is because the sections are not aligned when the > protections change, and the sections are not sorted by protection > type. They aren't sorted by protection type, they're ordered according to what's required for the kernel - which is to have the init sections together as one complete block so that it can be freed, and to place all of the kernel text as close to the beginning of the image as possible.
On Fri, Oct 05, 2012 at 09:45:00AM +0100, Russell King - ARM Linux wrote: > On Mon, Oct 01, 2012 at 10:06:39AM -0600, Jason Gunthorpe wrote: > > On Mon, Oct 01, 2012 at 04:39:53PM +0100, Dave Martin wrote: > > > > Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align > > > > LOAD 0x008000 0xc0008000 0x00008000 0x372244 0x3a4310 RWE 0x8000 > > > > > > Not related directly to your patch, but I wonder why we don't we see > > > separate r-x and rw- segments? > > > > I think this is because the sections are not aligned when the > > protections change, and the sections are not sorted by protection > > type. > > They aren't sorted by protection type, they're ordered according to what's > required for the kernel - which is to have the init sections together as > one complete block so that it can be freed, and to place all of the kernel > text as close to the beginning of the image as possible. Ah, right. Partly this came from some side speculation about whether we could do things like privileged read-only permissions on newer CPUs, for preventing unintended or undesired writes to the kernel's code or read-only data. There would be various things to solve in order to make that work, so I guess there's no great urgency for it right now. Cheers ---Dave
On Thu, Oct 04, 2012 at 11:59:07AM -0600, Jason Gunthorpe wrote: > On Thu, Oct 04, 2012 at 12:36:37PM +0100, Dave Martin wrote: > > > OK, so it is supported, but not for ARM, yet. I'm not sure that > > such a patch would be rejected, since building in a DTB is not > > really that different from building in a command-line. > > Maybe I will be able to look at this in a few months.. > > > The other solution to this problem is to distribute the dtb > > alongside the kernel as a supplementary image (similar to > > initramfs), with the bootloader doing the bare minimum of > > modifications to it. > > Yes, but we still need rely on complex code like I2C/MTD to create a > correct DTB, which again puts us back to patching the kernel for that > functionality. I'm still confused as to where this complexity is coming from. If you need to run some complex I2C and MTD code to generate the DT, what is that code doing? If it is probing to get the information, then can you avoid putting the info in the DT at all? Primarily the DT is to describe those aspects of the hardware which can't be probed. Otherwise, you that have a few static configurations: you could have one pre-baked DT per hardware platform, and choose the correct one once you have detected the platform. > > Naturally, if zero modifications are needed then the bootloader does not > > need libfdt (or equivalent code) at all. > > If only :) > > > One other option open to you would be to provide an ELF image loadable > > by your bootloader via a simple wrapper. This would probably be a more > > robust approach than carrying patches against head.S, since it removes > > any intimate dependency on the kernel code itself. > > This is what I ment by having a 2nd stage loader, but this example is > not complete because at this point you also have to patch into the DTB > the initrd address (passed in r0/r1), and the various MAC addresses > (getting them is the hard part..). > > The nice thing about the head.S patch is that it lives right after > _entry, so there actually are no dependencies on the kernel code, it > executes in the same environment as the example you presented. Well, logically there's little difference unless your patch were to be merged. An out-of-tree patch against head.S could be more effort to maintain in some situations, but that code doesn't evolve rapidly. > Anyhow, exploring a CONFIG_ARM_BUILT_IN_DTB.. I think I'd add > something like this to head.S: > > #ifdef CONFIG_ARM_BUILT_IN_DTB > adr r3,bootloader_r0 > stmia r3,{r0,r1,r2} > mov r2, #0 > mov r1, #0 > mov r0, #0 > #else > bl __vet_atags > #endif > > And something like this to early_dt_fixup: > > #ifdef CONFIG_ARM_BUILT_IN_DTB > for_each_dt_provider(dtp) { > devtree = dtp->get_dtb(); > if (devtree) > break; > } > #else > devtree = phys_to_virt(dt_phys); > #endif > > Where the 'dev tree provider' would use the stored bootloader > registers and any other information to return the proper DTB. It would need developing a bit, but something like that might be possible -- it should probably be discussed via devicetree-discuss. If it is doing anything less trivial than picking a pre-baked DT, the rationale would need to be carefully argued. Cheers ---Dave
On Mon, Oct 08, 2012 at 11:24:13AM +0100, Dave Martin wrote: > Partly this came from some side speculation about whether we could do > things like privileged read-only permissions on newer CPUs, for preventing > unintended or undesired writes to the kernel's code or read-only data. Some other arches page protect the kernel, but that tends to be at odds with the desire to use huge pages for the kernel mapping, and independent of the load headers.. Jason
On Mon, Oct 08, 2012 at 11:46:49AM +0100, Dave Martin wrote: > > Yes, but we still need rely on complex code like I2C/MTD to create a > > correct DTB, which again puts us back to patching the kernel for that > > functionality. > > I'm still confused as to where this complexity is coming from. > > If you need to run some complex I2C and MTD code to generate the DT, what > is that code doing? If it is probing to get the information, then can you > avoid putting the info in the DT at all? Primarily the DT is to describe > those aspects of the hardware which can't be probed. At manufacturing the unit is programmed with a small datastructure that contains all the unique ID's (MAC addresses, etc), manufacturing variations (ie vendor A or B was used for a socket) and other ancillary data. So a fair amount of I2C and MTD stack is required just to fetch this structure - a full CFI probe, for instance, is needed in the NOR flash case just to locate the structure. Once read, things like MAC addresses are copied into the DTB, and certain sections of the DTB are NOP'd out - we have stanzas for chip vendor A and vendor B, the one that was not put on the board is replaced with NOP. Similar to the A/B fix, a further fixup is done based on a runtime probe of the programmable devices to learn their current configuration/memory map. It seems desirable to present a complete/correct DTB to the kernel, it doesn't seem there are great places to hook in custom discovery procedures. From a maintenance perspective we already have to test/etc the kernel code for all of this, we don't want to do that twice by duplicating this stuff outside the kernel. > Otherwise, you that have a few static configurations: you could have one > pre-baked DT per hardware platform, and choose the correct one once you > have detected the platform. We do that too, for instance the PPC kernel we build supports 4 different circuit boards, each served by a separate DTB, that needs a fixup pass. I think the biggest DTB describes about 49 devices.. > > Where the 'dev tree provider' would use the stored bootloader > > registers and any other information to return the proper DTB. > > It would need developing a bit, but something like that might be > possible -- it should probably be discussed via devicetree-discuss. > > If it is doing anything less trivial than picking a pre-baked DT, the > rationale would need to be carefully argued. I'm not sure there is a great interest in this? What are other folks working on production embedded stuff doing? I suppose that will be more clear as device tree is rolled out. Jason
On Tue, Oct 09, 2012 at 11:37:06AM -0600, Jason Gunthorpe wrote: > On Mon, Oct 08, 2012 at 11:24:13AM +0100, Dave Martin wrote: > > > Partly this came from some side speculation about whether we could do > > things like privileged read-only permissions on newer CPUs, for preventing > > unintended or undesired writes to the kernel's code or read-only data. > > Some other arches page protect the kernel, but that tends to be at > odds with the desire to use huge pages for the kernel mapping, and > independent of the load headers.. This wasn't so much about that headers themselves as about fragmentation of the page permissions which makes it difficult to map everything using huge pages / sections. But as you say, there are conflicting concerns here, and it seems not to be a priority. Privileged write-protect is nice to have if non-disruptive, but not essential. Cheers ---Dave
On Tue, Oct 09, 2012 at 12:25:14PM -0600, Jason Gunthorpe wrote: > On Mon, Oct 08, 2012 at 11:46:49AM +0100, Dave Martin wrote: > > > > Yes, but we still need rely on complex code like I2C/MTD to create a > > > correct DTB, which again puts us back to patching the kernel for that > > > functionality. > > > > I'm still confused as to where this complexity is coming from. > > > > If you need to run some complex I2C and MTD code to generate the DT, what > > is that code doing? If it is probing to get the information, then can you > > avoid putting the info in the DT at all? Primarily the DT is to describe > > those aspects of the hardware which can't be probed. > > At manufacturing the unit is programmed with a small datastructure that > contains all the unique ID's (MAC addresses, etc), manufacturing > variations (ie vendor A or B was used for a socket) and other > ancillary data. As a side comment, some things such as MAC addresses can be probed and set from userspace after kernel boot, assuming that you have a way to fetch the device description blob from userspace. If it's accessible via MTD I'm guessing you may have some chance of doing that. Identifying the board variant is harder to defer though. > So a fair amount of I2C and MTD stack is required just to fetch this > structure - a full CFI probe, for instance, is needed in the NOR flash > case just to locate the structure. > > Once read, things like MAC addresses are copied into the DTB, and > certain sections of the DTB are NOP'd out - we have stanzas for chip > vendor A and vendor B, the one that was not put on the board is > replaced with NOP. > > Similar to the A/B fix, a further fixup is done based on a runtime > probe of the programmable devices to learn their current > configuration/memory map. > > It seems desirable to present a complete/correct DTB to the kernel, > it doesn't seem there are great places to hook in custom discovery > procedures. > > From a maintenance perspective we already have to test/etc the kernel > code for all of this, we don't want to do that twice by duplicating > this stuff outside the kernel. OK, that gives me a good understanding of what you're trying to achieve here. However, I worry that if you have to run hardware-dependent code in order to fetch or generate parts of the DT, then we have a chicken- and-egg problem with no guaranteed solution with the frameworks that exist today -- although I am not familiar with how DT gets used on all other arches, so you might have counterexamples I'm not aware of. At least in ARM-land, the DT is inherently monolithic and static: the DT is not expected to change once you enter the kernel. Of course, more or less anything can be done with local patches, but merging such functionality in a clean way might be a challenge. > > Otherwise, you that have a few static configurations: you could have one > > pre-baked DT per hardware platform, and choose the correct one once you > > have detected the platform. > > We do that too, for instance the PPC kernel we build supports 4 > different circuit boards, each served by a separate DTB, that needs a > fixup pass. > > I think the biggest DTB describes about 49 devices.. > > > > Where the 'dev tree provider' would use the stored bootloader > > > registers and any other information to return the proper DTB. > > > > It would need developing a bit, but something like that might be > > possible -- it should probably be discussed via devicetree-discuss. > > > > If it is doing anything less trivial than picking a pre-baked DT, the > > rationale would need to be carefully argued. > > I'm not sure there is a great interest in this? What are other folks > working on production embedded stuff doing? I suppose that will be > more clear as device tree is rolled out. For now, the architecturally simplest solution still seems to me to be to write your own boot shim which customises the DT before booting the kernel. As discussed, this can continue to look like a simple ELF image from the bootloader's point of view -- but I appreciate that it will involve effort and some duplication of some low-level driver components. If we could do dynamic DT properly (either fully dynamic, or dynamic in a more constrained way as in your suggested patch), that would provided an architected way to solve this problem from within the kernel. This could certainly be worth raising on devicetree-discuss. Your problem is unlikely to affect people outside the embedded space too much, but it doesn't sound likely to be unique. Cheers ---Dave
On Wed, Oct 10, 2012 at 10:55:44AM +0100, Dave Martin wrote: > As a side comment, some things such as MAC addresses can be probed and > set from userspace after kernel boot, assuming that you have a way > to fetch the device description blob from userspace. If it's accessible > via MTD I'm guessing you may have some chance of doing that. Yes, there is a whole wack of other data in this structure (serial numbers, etc) that are captured and used by userspace, so access is always made possible. Setting the MAC via userspace is interesting, I will bear that in mind. However, we have several development-support boot scenarios that enable the ethernet before the userspace code to fetch the data structure is loaded - one of interest has the kernel's initramfs perform a NBD mount of the actual application filesystem. So having the kernel tend to this is very convenient. > However, I worry that if you have to run hardware-dependent code in > order to fetch or generate parts of the DT, then we have a chicken- > and-egg problem with no guaranteed solution with the frameworks that > exist today -- although I am not familiar with how DT gets used on > all other arches, so you might have counterexamples I'm not aware of. Certainly, in some cases, (like what I have done) such kernels are non-generic and could only work with the hardware they are intended for. However, if you assume the bootloader provides a dtb (even an incomplete one would do) then fixups/replacements/whatever could be run based on tags in the dtb without creating a chicken-and-egg problem. > At least in ARM-land, the DT is inherently monolithic and static: the > DT is not expected to change once you enter the kernel. Sort of, it is the same on all arches, the DT block is expected to remain valid, but all the of_node pointers go directly into the block so you can alter values but not the structure without breaking any invariants. As an example our DTBs all have mac address blocks, with a 0 value. The fixup simply overwrites with the correct value before the ethernet driver reads from it. > > I'm not sure there is a great interest in this? What are other folks > > working on production embedded stuff doing? I suppose that will be > > more clear as device tree is rolled out. > > For now, the architecturally simplest solution still seems to me to be > to write your own boot shim which customises the DT before booting the > kernel. As discussed, this can continue to look like a simple ELF image > from the bootloader's point of view -- but I appreciate that it will > involve effort and some duplication of some low-level driver components. I admit, I am a bit conflicted. Viewing from the kernel perspective, I definitely think this kind of very low level extremely board specific stuff should not be in the kernel. That could be a maintenance nightmare. However, as an OEM, I need the smallest, simplest boot loader possible so I want to push this into the kernel to lighten my development load. > Your problem is unlikely to affect people outside the embedded space > too much, but it doesn't sound likely to be unique. Agreed.. This thread has left the original topic, what do you think I should do with the linker patch? Jason
diff --git a/arch/arm/include/asm/memory.h b/arch/arm/include/asm/memory.h index 5f6ddcc..4ce5b6d 100644 --- a/arch/arm/include/asm/memory.h +++ b/arch/arm/include/asm/memory.h @@ -283,7 +283,7 @@ static inline __deprecated void *bus_to_virt(unsigned long x) #define arch_is_coherent() 0 #endif -#endif +#endif /* __ASSEMBLY__ */ #include <asm-generic/memory_model.h> diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S index 36ff15b..07942b6 100644 --- a/arch/arm/kernel/vmlinux.lds.S +++ b/arch/arm/kernel/vmlinux.lds.S @@ -3,6 +3,13 @@ * Written by Martin Mares <mj@atrey.karlin.mff.cuni.cz> */ +/* If we have a known, fixed physical load address then set LOAD_OFFSET + and generate an ELF that has the physical load address in the program + headers. */ +#ifndef CONFIG_ARM_PATCH_PHYS_VIRT +#define LOAD_OFFSET (PAGE_OFFSET - PHYS_OFFSET) +#endif + #include <asm-generic/vmlinux.lds.h> #include <asm/cache.h> #include <asm/thread_info.h> @@ -39,7 +46,7 @@ #endif OUTPUT_ARCH(arm) -ENTRY(stext) +ENTRY(phys_start) #ifndef __ARMEB__ jiffies = jiffies_64; @@ -86,11 +93,13 @@ SECTIONS #else . = PAGE_OFFSET + TEXT_OFFSET; #endif - .head.text : { + .head.text : AT(ADDR(.head.text) - LOAD_OFFSET) { _text = .; + phys_start = . - LOAD_OFFSET; HEAD_TEXT } - .text : { /* Real text segment */ + /* Real text segment */ + .text : AT(ADDR(.text) - LOAD_OFFSET) { _stext = .; /* Text and read-only data */ __exception_text_start = .; *(.exception.text) @@ -119,12 +128,12 @@ SECTIONS * Stack unwinding tables */ . = ALIGN(8); - .ARM.unwind_idx : { + .ARM.unwind_idx : AT(ADDR(.ARM.unwind_idx) - LOAD_OFFSET) { __start_unwind_idx = .; *(.ARM.exidx*) __stop_unwind_idx = .; } - .ARM.unwind_tab : { + .ARM.unwind_tab : AT(ADDR(.ARM.unwind_tab) - LOAD_OFFSET) { __start_unwind_tab = .; *(.ARM.extab*) __stop_unwind_tab = .; @@ -139,35 +148,35 @@ SECTIONS #endif INIT_TEXT_SECTION(8) - .exit.text : { + .exit.text : AT(ADDR(.exit.text) - LOAD_OFFSET) { ARM_EXIT_KEEP(EXIT_TEXT) } - .init.proc.info : { + .init.proc.info : AT(ADDR(.init.proc.info) - LOAD_OFFSET) { ARM_CPU_DISCARD(PROC_INFO) } - .init.arch.info : { + .init.arch.info : AT(ADDR(.init.arch.info) - LOAD_OFFSET) { __arch_info_begin = .; *(.arch.info.init) __arch_info_end = .; } - .init.tagtable : { + .init.tagtable : AT(ADDR(.init.tagtable) - LOAD_OFFSET) { __tagtable_begin = .; *(.taglist.init) __tagtable_end = .; } #ifdef CONFIG_SMP_ON_UP - .init.smpalt : { + .init.smpalt : AT(ADDR(.init.smpalt) - LOAD_OFFSET) { __smpalt_begin = .; *(.alt.smp.init) __smpalt_end = .; } #endif - .init.pv_table : { + .init.pv_table : AT(ADDR(.init.pv_table) - LOAD_OFFSET) { __pv_table_begin = .; *(.pv_table) __pv_table_end = .; } - .init.data : { + .init.data : AT(ADDR(.init.data) - LOAD_OFFSET) { #ifndef CONFIG_XIP_KERNEL INIT_DATA #endif @@ -178,7 +187,7 @@ SECTIONS INIT_RAM_FS } #ifndef CONFIG_XIP_KERNEL - .exit.data : { + .exit.data : AT(ADDR(.exit.data) - LOAD_OFFSET) { ARM_EXIT_KEEP(EXIT_DATA) } #endif @@ -196,7 +205,7 @@ SECTIONS __data_loc = .; #endif - .data : AT(__data_loc) { + .data : AT(__data_loc - LOAD_OFFSET) { _data = .; /* address in memory */ _sdata = .; @@ -245,7 +254,7 @@ SECTIONS * free it after init has commenced and TCM contents have * been copied to its destination. */ - .tcm_start : { + .tcm_start : AT(ADDR(.tcm_start) - LOAD_OFFSET) { . = ALIGN(PAGE_SIZE); __tcm_start = .; __itcm_start = .; @@ -257,7 +266,7 @@ SECTIONS * and we'll upload the contents from RAM to TCM and free * the used RAM after that. */ - .text_itcm ITCM_OFFSET : AT(__itcm_start) + .text_itcm ITCM_OFFSET : AT(__itcm_start - LOAD_OFFSET) { __sitcm_text = .; *(.tcm.text) @@ -272,12 +281,12 @@ SECTIONS */ . = ADDR(.tcm_start) + SIZEOF(.tcm_start) + SIZEOF(.text_itcm); - .dtcm_start : { + .dtcm_start : AT(ADDR(.dtcm_start) - LOAD_OFFSET) { __dtcm_start = .; } /* TODO: add remainder of ITCM as well, that can be used for data! */ - .data_dtcm DTCM_OFFSET : AT(__dtcm_start) + .data_dtcm DTCM_OFFSET : AT(__dtcm_start - LOAD_OFFSET) { . = ALIGN(4); __sdtcm_data = .; @@ -290,7 +299,7 @@ SECTIONS . = ADDR(.dtcm_start) + SIZEOF(.data_dtcm); /* End marker for freeing TCM copy in linked object */ - .tcm_end : AT(ADDR(.dtcm_start) + SIZEOF(.data_dtcm)){ + .tcm_end : AT(ADDR(.dtcm_start) + SIZEOF(.data_dtcm) - LOAD_OFFSET){ . = ALIGN(PAGE_SIZE); __tcm_end = .; }
The standard linux asm-generic/vmlinux.lds.h already supports this, and it seems other architectures do as well. The goal is to create an ELF file that has correct program headers. We want to see the VirtAddr be the runtime address of the kernel with the MMU turned on, and PhysAddr be the physical load address for the section with no MMU. This allows ELF based boot loaders to properly load vmlinux: $ readelf -l vmlinux Entry point 0x8000 Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x008000 0xc0008000 0x00008000 0x372244 0x3a4310 RWE 0x8000 Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> --- arch/arm/include/asm/memory.h | 2 +- arch/arm/kernel/vmlinux.lds.S | 47 ++++++++++++++++++++++++---------------- 2 files changed, 29 insertions(+), 20 deletions(-)