Message ID | 1405363248.25580.12.camel@deneb.redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jul 14, 2014 at 02:40:48PM -0400, Mark Salter wrote: > On Mon, 2014-07-14 at 17:25 +0200, Ard Biesheuvel wrote: > > If we fail to relocate the kernel Image to its preferred offset of TEXT_OFFSET > > bytes above the base of DRAM, accept the lowest alternative mapping available > > instead of aborting. We may lose a bit of memory at the low end, but we can > > still proceed normally otherwise. > > This breaks APM Mustang because the spin-table holding pen for secondary > CPUs is marked as reserved memory in the TEXT_OFFSET area and the kernel > placement using your patch makes it unreachable by kernel. Here is a > patch I've been working with to solve the same problem: Hmm. The thing I don't like about the below approach is that it hard wires in the "memory below TEXT_OFFSET cannot be used" aspect, beyond the current prectical limitation. Since we are likely to see platforms with UEFI memory in use around start of RAM, that is a limitation we should probably try to get rid of. > From: Mark Salter <msalter@redhat.com> > Date: Thu, 10 Jul 2014 09:25:30 -0400 > Subject: [PATCH] arm64/efi: try to handle firmware located below kernel > > The rule for arm64 kernel image placement is that it must be located > TEXT_OFFSET bytes past a 2MiB boundary. The kernel itself will use the > TEXT_OFFSET sized area for initial page tables but that area is not > part of the kernel image itself. > > The current EFI stub code finds the base of DRAM from the EFI memmap > and relocates the kernel to dram_base+TEXT_OFFSET. This assumes that > the low memory is not being used and the kernel relocation simply > fails if the base memory allocation fails. > > At least one vendor has firmware which occupies memory near dram_base > so kernel relocations always fail. This patch attempts to work with > such firmware by searching the EFI memmap for the lowest available > memory which may be used for the kernel image. There are several > pitfalls remaining which may lead to boot failure: > > * The stub does not allocate the TEXT_OFFSET region, so it is > required that the firmware not be using that area for anything > which may interfere or overlap with the initial kernel page > tables. We can't simply include that area in our search for > available memory because firmware using the spin-table method > for booting secondary CPUs may place the CPU pen in an out of > the way part of that region and mark it as reserved memory. > > * The current code requires FDT to be placed within first 512MiB > of DRAM (with the kernel below it). This requirement can be > removed in the future, but would involve changes to generic > stub code shared by other architectures. > > Signed-off-by: Mark Salter <msalter@redhat.com> > --- > arch/arm64/kernel/efi-stub.c | 45 +++++++++++++++++++++++++++++++++++++------- > arch/arm64/kernel/efi.c | 19 ++++++++++++++++--- > 2 files changed, 54 insertions(+), 10 deletions(-) > > diff --git a/arch/arm64/kernel/efi-stub.c b/arch/arm64/kernel/efi-stub.c > index 60e98a63..f5da27f 100644 > --- a/arch/arm64/kernel/efi-stub.c > +++ b/arch/arm64/kernel/efi-stub.c > @@ -54,21 +54,53 @@ static efi_status_t handle_kernel_image(efi_system_table_t *sys_table, > efi_loaded_image_t *image) > { > efi_status_t status; > - unsigned long kernel_size, kernel_memsize = 0; > + unsigned long kernel_size, kernel_memsize; > + unsigned long desired_base = dram_base + TEXT_OFFSET; > + unsigned long desired_end; > + unsigned long map_size; > + struct efi_memory_map map; > + efi_memory_desc_t *md; > > /* Relocate the image, if required. */ > kernel_size = _edata - _text; > - if (*image_addr != (dram_base + TEXT_OFFSET)) { > - kernel_memsize = kernel_size + (_end - _edata); > + kernel_memsize = kernel_size + (_end - _edata); > + > + desired_end = desired_base + kernel_size; > + > + /* find lowest available address for kernel to live */ > + status = efi_get_memory_map(sys_table, (efi_memory_desc_t **)&map.map, > + &map_size, &map.desc_size, NULL, NULL); > + if (status == EFI_SUCCESS) { > + map.map_end = map.map + map_size; > + for_each_efi_memory_desc(&map, md) { > + unsigned long start, end, offset; > + if (!(md->attribute & EFI_MEMORY_WB)) > + continue; > + if (md->type != EFI_CONVENTIONAL_MEMORY) > + continue; > + start = md->phys_addr; > + end = start + (md->num_pages << EFI_PAGE_SHIFT); > + offset = start & (SZ_2M - 1); > + if (offset < TEXT_OFFSET) > + start += (TEXT_OFFSET - offset); > + else if (offset > TEXT_OFFSET) > + start = ALIGN(start, SZ_2M) + TEXT_OFFSET; > + if (start < end && (start + kernel_memsize) <= end) { > + desired_base = start; > + break; > + } > + } > + } I think this would be useful as a helper function - efi_alloc_above() or something. > + > + if (*image_addr != desired_base) { > status = efi_relocate_kernel(sys_table, image_addr, > kernel_size, kernel_memsize, > - dram_base + TEXT_OFFSET, > - PAGE_SIZE); > + desired_base, PAGE_SIZE); > if (status != EFI_SUCCESS) { > pr_efi_err(sys_table, "Failed to relocate kernel\n"); > return status; > } > - if (*image_addr != (dram_base + TEXT_OFFSET)) { > + if (*image_addr != desired_base) { > pr_efi_err(sys_table, "Failed to alloc kernel memory\n"); > efi_free(sys_table, kernel_memsize, *image_addr); > return EFI_ERROR; > @@ -76,6 +108,5 @@ static efi_status_t handle_kernel_image(efi_system_table_t *sys_table, > *image_size = kernel_memsize; > } > > - > return EFI_SUCCESS; > } > diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c > index 453b7f8..2bc6469 100644 > --- a/arch/arm64/kernel/efi.c > +++ b/arch/arm64/kernel/efi.c > @@ -180,9 +180,18 @@ static __init void reserve_regions(void) > if (is_reserve_region(md) || > md->type == EFI_BOOT_SERVICES_CODE || > md->type == EFI_BOOT_SERVICES_DATA) { > - memblock_reserve(paddr, size); > - if (uefi_debug) > - pr_cont("*"); > + if (paddr < PHYS_OFFSET) { > + if ((paddr + size) > PHYS_OFFSET) { > + size -= (PHYS_OFFSET - paddr); > + memblock_reserve(PHYS_OFFSET, size); > + if (uefi_debug) > + pr_cont("**"); > + } > + } else { > + memblock_reserve(paddr, size); > + if (uefi_debug) > + pr_cont("*"); > + } > } > > if (uefi_debug) > @@ -273,6 +282,10 @@ static void __init free_boot_services(void) > continue; > } > > + /* Don't free anything below kernel */ > + if (md->phys_addr < PHYS_OFFSET) > + continue; > + Is the spin table area really allocated as BOOT_SERVICES_*? / Leif
On Tue, Jul 15, 2014 at 11:02:22AM +0100, Leif Lindholm wrote: > On Mon, Jul 14, 2014 at 02:40:48PM -0400, Mark Salter wrote: > > On Mon, 2014-07-14 at 17:25 +0200, Ard Biesheuvel wrote: > > > If we fail to relocate the kernel Image to its preferred offset of TEXT_OFFSET > > > bytes above the base of DRAM, accept the lowest alternative mapping available > > > instead of aborting. We may lose a bit of memory at the low end, but we can > > > still proceed normally otherwise. > > > > This breaks APM Mustang because the spin-table holding pen for secondary > > CPUs is marked as reserved memory in the TEXT_OFFSET area and the kernel > > placement using your patch makes it unreachable by kernel. Here is a > > patch I've been working with to solve the same problem: > > Hmm. The thing I don't like about the below approach is that it hard > wires in the "memory below TEXT_OFFSET cannot be used" aspect, beyond > the current prectical limitation. > > Since we are likely to see platforms with UEFI memory in use around > start of RAM, that is a limitation we should probably try to get rid of. This isn't just an issue for UEFI. There are other reasons one might want to load a kernel away from the start of RAM while still wanting to address said RAM(e.g. kdump). We should address that. [...] > > @@ -273,6 +282,10 @@ static void __init free_boot_services(void) > > continue; > > } > > > > + /* Don't free anything below kernel */ > > + if (md->phys_addr < PHYS_OFFSET) > > + continue; > > + > > Is the spin table area really allocated as BOOT_SERVICES_*? If that is the case, this platform is _broken_. The spin-table memory (both the code and the mailboxes) needs to live around forever in case you don't boot all of the secondaries. Thanks, Mark.
On Tue, 2014-07-15 at 11:02 +0100, Leif Lindholm wrote: > > @@ -273,6 +282,10 @@ static void __init free_boot_services(void) > > continue; > > } > > > > + /* Don't free anything below kernel */ > > + if (md->phys_addr < PHYS_OFFSET) > > + continue; > > + > > Is the spin table area really allocated as BOOT_SERVICES_*? > No. It is EFI_RESERVED_TYPE. But if UEFI is allowed below the kernel, then there could be BS code/data memory which we'd want to ignore.
On Tue, Jul 15, 2014 at 09:11:00AM -0400, Mark Salter wrote: > On Tue, 2014-07-15 at 11:02 +0100, Leif Lindholm wrote: > > > @@ -273,6 +282,10 @@ static void __init free_boot_services(void) > > > continue; > > > } > > > > > > + /* Don't free anything below kernel */ > > > + if (md->phys_addr < PHYS_OFFSET) > > > + continue; > > > + > > > > Is the spin table area really allocated as BOOT_SERVICES_*? > > No. It is EFI_RESERVED_TYPE. But if UEFI is allowed below the kernel, > then there could be BS code/data memory which we'd want to ignore. Well, if it is boot service code/data - then there is no need for us to keep it around after ExitBootServices(). / Leif
On Tue, 2014-07-15 at 14:54 +0100, Leif Lindholm wrote: > On Tue, Jul 15, 2014 at 09:11:00AM -0400, Mark Salter wrote: > > On Tue, 2014-07-15 at 11:02 +0100, Leif Lindholm wrote: > > > > @@ -273,6 +282,10 @@ static void __init free_boot_services(void) > > > > continue; > > > > } > > > > > > > > + /* Don't free anything below kernel */ > > > > + if (md->phys_addr < PHYS_OFFSET) > > > > + continue; > > > > + > > > > > > Is the spin table area really allocated as BOOT_SERVICES_*? > > > > No. It is EFI_RESERVED_TYPE. But if UEFI is allowed below the kernel, > > then there could be BS code/data memory which we'd want to ignore. > > Well, if it is boot service code/data - then there is no need for us > to keep it around after ExitBootServices(). > > / > Leif One would think, but EFI has proven to be less than strictly compliant in that regard in the past. I'm inclined to keep the boot services around until after SetVirtualAddressMap just in case.
On Tue, Jul 15, 2014 at 03:23:26PM +0100, Mark Salter wrote: > On Tue, 2014-07-15 at 14:54 +0100, Leif Lindholm wrote: > > On Tue, Jul 15, 2014 at 09:11:00AM -0400, Mark Salter wrote: > > > On Tue, 2014-07-15 at 11:02 +0100, Leif Lindholm wrote: > > > > > @@ -273,6 +282,10 @@ static void __init free_boot_services(void) > > > > > continue; > > > > > } > > > > > > > > > > + /* Don't free anything below kernel */ > > > > > + if (md->phys_addr < PHYS_OFFSET) > > > > > + continue; > > > > > + > > > > > > > > Is the spin table area really allocated as BOOT_SERVICES_*? > > > > > > No. It is EFI_RESERVED_TYPE. But if UEFI is allowed below the kernel, > > > then there could be BS code/data memory which we'd want to ignore. > > > > Well, if it is boot service code/data - then there is no need for us > > to keep it around after ExitBootServices(). > > > > / > > Leif > > One would think, but EFI has proven to be less than strictly compliant > in that regard in the past. I'm inclined to keep the boot services > around until after SetVirtualAddressMap just in case. Why should we add a work around for a potential bug that doesn't exist yet? That just provides fertile ground for such a bug to spring into existence and for people to ignore it when bringing up their SoC. The comment doesn't explain the rationale and the code doesn't make sense given a sane implementation. For the moment it's better to be strict, IMO. Otherwise there are plenty of other potential bugs we could attempt to work around to enable people to write firmware with even lower standards... If we have to work around something then we should have an actual issue to work around first. Thanks, Mark.
On Tue, Jul 15, 2014 at 10:23:26AM -0400, Mark Salter wrote: > On Tue, 2014-07-15 at 14:54 +0100, Leif Lindholm wrote: > > On Tue, Jul 15, 2014 at 09:11:00AM -0400, Mark Salter wrote: > > > On Tue, 2014-07-15 at 11:02 +0100, Leif Lindholm wrote: > > > > > @@ -273,6 +282,10 @@ static void __init free_boot_services(void) > > > > > continue; > > > > > } > > > > > > > > > > + /* Don't free anything below kernel */ > > > > > + if (md->phys_addr < PHYS_OFFSET) > > > > > + continue; > > > > > + > > > > > > > > Is the spin table area really allocated as BOOT_SERVICES_*? > > > > > > No. It is EFI_RESERVED_TYPE. But if UEFI is allowed below the kernel, > > > then there could be BS code/data memory which we'd want to ignore. > > > > Well, if it is boot service code/data - then there is no need for us > > to keep it around after ExitBootServices(). > > One would think, but EFI has proven to be less than strictly compliant > in that regard in the past. I'm inclined to keep the boot services > around until after SetVirtualAddressMap just in case. But the function you add this clause to will still throw away all boot services code/data regions - just with this modification it skips those that happen to lie lower in the address space than the kernel. (And I do agree with Mark R here - let's not work around bugs that don't exist yet.) / Leif
On Tue, 2014-07-15 at 15:49 +0100, Leif Lindholm wrote: > On Tue, Jul 15, 2014 at 10:23:26AM -0400, Mark Salter wrote: > > On Tue, 2014-07-15 at 14:54 +0100, Leif Lindholm wrote: > > > On Tue, Jul 15, 2014 at 09:11:00AM -0400, Mark Salter wrote: > > > > On Tue, 2014-07-15 at 11:02 +0100, Leif Lindholm wrote: > > > > > > @@ -273,6 +282,10 @@ static void __init free_boot_services(void) > > > > > > continue; > > > > > > } > > > > > > > > > > > > + /* Don't free anything below kernel */ > > > > > > + if (md->phys_addr < PHYS_OFFSET) > > > > > > + continue; > > > > > > + > > > > > > > > > > Is the spin table area really allocated as BOOT_SERVICES_*? > > > > > > > > No. It is EFI_RESERVED_TYPE. But if UEFI is allowed below the kernel, > > > > then there could be BS code/data memory which we'd want to ignore. > > > > > > Well, if it is boot service code/data - then there is no need for us > > > to keep it around after ExitBootServices(). > > > > One would think, but EFI has proven to be less than strictly compliant > > in that regard in the past. I'm inclined to keep the boot services > > around until after SetVirtualAddressMap just in case. > > But the function you add this clause to will still throw away all boot > services code/data regions - just with this modification it skips > those that happen to lie lower in the address space than the kernel. > > (And I do agree with Mark R here - let's not work around bugs that > don't exist yet.) > I'm not sure if they still exist or not, but on Foundation, I saw a crash in SetVirtualAddressMap unless I kept BS regions around.
On Tue, Jul 15, 2014 at 11:04:37AM -0400, Mark Salter wrote: > On Tue, 2014-07-15 at 15:49 +0100, Leif Lindholm wrote: > > On Tue, Jul 15, 2014 at 10:23:26AM -0400, Mark Salter wrote: > > > On Tue, 2014-07-15 at 14:54 +0100, Leif Lindholm wrote: > > > > On Tue, Jul 15, 2014 at 09:11:00AM -0400, Mark Salter wrote: > > > > > On Tue, 2014-07-15 at 11:02 +0100, Leif Lindholm wrote: > > > > > > > @@ -273,6 +282,10 @@ static void __init free_boot_services(void) > > > > > > > continue; > > > > > > > } > > > > > > > > > > > > > > + /* Don't free anything below kernel */ > > > > > > > + if (md->phys_addr < PHYS_OFFSET) > > > > > > > + continue; > > > > > > > + > > > > > > > > > > > > Is the spin table area really allocated as BOOT_SERVICES_*? > > > > > > > > > > No. It is EFI_RESERVED_TYPE. But if UEFI is allowed below the kernel, > > > > > then there could be BS code/data memory which we'd want to ignore. > > > > > > > > Well, if it is boot service code/data - then there is no need for us > > > > to keep it around after ExitBootServices(). > > > > > > One would think, but EFI has proven to be less than strictly compliant > > > in that regard in the past. I'm inclined to keep the boot services > > > around until after SetVirtualAddressMap just in case. > > > > But the function you add this clause to will still throw away all boot > > services code/data regions - just with this modification it skips > > those that happen to lie lower in the address space than the kernel. Returning to the actual code we are discussing here: The hunk above has no bearing on whether boot services regions are generally unmapped or not. It only filters explicitly those boot services regions that happen to be lower in memory than the kernel, and keep them around for the duration of the system. > > > > (And I do agree with Mark R here - let's not work around bugs that > > don't exist yet.) > > > > I'm not sure if they still exist or not, but on Foundation, I saw a > crash in SetVirtualAddressMap unless I kept BS regions around. For the topic of keeping boot services code around: I did also see issues with not keeping boot services regions on v7 - ages ago. I have not seen it this year, and I _really_ want to see if any such issues resurface. So post-3.16 I would quite like to see the call to free_boot_services() moved earlier to flush out any such issues before we see large-scale deployments. The foundation model is a development tool, not a production system, so any issue reproducable only there is a pure bonus - since that firmware can _always_ be updated. / Leif
diff --git a/arch/arm64/kernel/efi-stub.c b/arch/arm64/kernel/efi-stub.c index 60e98a63..f5da27f 100644 --- a/arch/arm64/kernel/efi-stub.c +++ b/arch/arm64/kernel/efi-stub.c @@ -54,21 +54,53 @@ static efi_status_t handle_kernel_image(efi_system_table_t *sys_table, efi_loaded_image_t *image) { efi_status_t status; - unsigned long kernel_size, kernel_memsize = 0; + unsigned long kernel_size, kernel_memsize; + unsigned long desired_base = dram_base + TEXT_OFFSET; + unsigned long desired_end; + unsigned long map_size; + struct efi_memory_map map; + efi_memory_desc_t *md; /* Relocate the image, if required. */ kernel_size = _edata - _text; - if (*image_addr != (dram_base + TEXT_OFFSET)) { - kernel_memsize = kernel_size + (_end - _edata); + kernel_memsize = kernel_size + (_end - _edata); + + desired_end = desired_base + kernel_size; + + /* find lowest available address for kernel to live */ + status = efi_get_memory_map(sys_table, (efi_memory_desc_t **)&map.map, + &map_size, &map.desc_size, NULL, NULL); + if (status == EFI_SUCCESS) { + map.map_end = map.map + map_size; + for_each_efi_memory_desc(&map, md) { + unsigned long start, end, offset; + if (!(md->attribute & EFI_MEMORY_WB)) + continue; + if (md->type != EFI_CONVENTIONAL_MEMORY) + continue; + start = md->phys_addr; + end = start + (md->num_pages << EFI_PAGE_SHIFT); + offset = start & (SZ_2M - 1); + if (offset < TEXT_OFFSET) + start += (TEXT_OFFSET - offset); + else if (offset > TEXT_OFFSET) + start = ALIGN(start, SZ_2M) + TEXT_OFFSET; + if (start < end && (start + kernel_memsize) <= end) { + desired_base = start; + break; + } + } + } + + if (*image_addr != desired_base) { status = efi_relocate_kernel(sys_table, image_addr, kernel_size, kernel_memsize, - dram_base + TEXT_OFFSET, - PAGE_SIZE); + desired_base, PAGE_SIZE); if (status != EFI_SUCCESS) { pr_efi_err(sys_table, "Failed to relocate kernel\n"); return status; } - if (*image_addr != (dram_base + TEXT_OFFSET)) { + if (*image_addr != desired_base) { pr_efi_err(sys_table, "Failed to alloc kernel memory\n"); efi_free(sys_table, kernel_memsize, *image_addr); return EFI_ERROR; @@ -76,6 +108,5 @@ static efi_status_t handle_kernel_image(efi_system_table_t *sys_table, *image_size = kernel_memsize; } - return EFI_SUCCESS; } diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c index 453b7f8..2bc6469 100644 --- a/arch/arm64/kernel/efi.c +++ b/arch/arm64/kernel/efi.c @@ -180,9 +180,18 @@ static __init void reserve_regions(void) if (is_reserve_region(md) || md->type == EFI_BOOT_SERVICES_CODE || md->type == EFI_BOOT_SERVICES_DATA) { - memblock_reserve(paddr, size); - if (uefi_debug) - pr_cont("*"); + if (paddr < PHYS_OFFSET) { + if ((paddr + size) > PHYS_OFFSET) { + size -= (PHYS_OFFSET - paddr); + memblock_reserve(PHYS_OFFSET, size); + if (uefi_debug) + pr_cont("**"); + } + } else { + memblock_reserve(paddr, size); + if (uefi_debug) + pr_cont("*"); + } } if (uefi_debug) @@ -273,6 +282,10 @@ static void __init free_boot_services(void) continue; } + /* Don't free anything below kernel */ + if (md->phys_addr < PHYS_OFFSET) + continue; + /* * We want to free memory from this region. */