diff mbox series

[v2] resource: limit request_free_mem_region based on arch_get_mappable_range

Message ID 20240709002757.2431399-1-scott@os.amperecomputing.com (mailing list archive)
State New
Headers show
Series [v2] resource: limit request_free_mem_region based on arch_get_mappable_range | expand

Commit Message

D Scott Phillips July 9, 2024, 12:27 a.m. UTC
On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order
of struct page size to dimension region"), the amdgpu driver could trip
over the warning of:

`WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));`

in vmemmap_populate()[1]. After that commit, it becomes a translation fault
and panic[2].

The cause is that the amdgpu driver allocates some unused space from
iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and
devm_memremap_pages() it. An address above those backed by the arm64
vmemmap is picked.

Limit request_free_mem_region() so that only addresses within the
arch_get_mappable_range() can be chosen as device private addresses.

[1]: Call trace:
      vmemmap_populate+0x30/0x48
      __populate_section_memmap+0x40/0x90
      sparse_add_section+0xfc/0x3e8
      __add_pages+0xb4/0x168
      pagemap_range+0x300/0x410
      memremap_pages+0x184/0x2d8
      devm_memremap_pages+0x30/0x90
      kgd2kfd_init_zone_device+0xe0/0x1f0 [amdgpu]
      amdgpu_device_ip_init+0x674/0x888 [amdgpu]
      amdgpu_device_init+0x7bc/0xed8 [amdgpu]
      amdgpu_driver_load_kms+0x28/0x1c0 [amdgpu]
      amdgpu_pci_probe+0x194/0x580 [amdgpu]
      local_pci_probe+0x48/0xb8
      work_for_cpu_fn+0x24/0x40
      process_one_work+0x170/0x3e0
      worker_thread+0x2ac/0x3e0
      kthread+0xf4/0x108
      ret_from_fork+0x10/0x20

[2]: Unable to handle kernel paging request at virtual address
             000001ffa6000034
     Mem abort info:
       ESR = 0x0000000096000044
       EC = 0x25: DABT (current EL), IL = 32 bits
       SET = 0, FnV = 0
       EA = 0, S1PTW = 0
       FSC = 0x04: level 0 translation fault
     Data abort info:
       ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000
       CM = 0, WnR = 1, TnD = 0, TagAccess = 0
       GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
     user pgtable: 4k pages, 48-bit VAs, pgdp=000008000287c000
     [000001ffa6000034] pgd=0000000000000000, p4d=0000000000000000
     Call trace:
      __init_zone_device_page.constprop.0+0x2c/0xa8
      memmap_init_zone_device+0xf0/0x210
      pagemap_range+0x1e0/0x410
      memremap_pages+0x18c/0x2e0
      devm_memremap_pages+0x30/0x90
      kgd2kfd_init_zone_device+0xf0/0x200 [amdgpu]
      amdgpu_device_ip_init+0x674/0x888 [amdgpu]
      amdgpu_device_init+0x7a4/0xea0 [amdgpu]
      amdgpu_driver_load_kms+0x28/0x1c0 [amdgpu]
      amdgpu_pci_probe+0x1a0/0x560 [amdgpu]
      local_pci_probe+0x48/0xb8
      work_for_cpu_fn+0x24/0x40
      process_one_work+0x170/0x3e0
      worker_thread+0x2ac/0x3e0
      kthread+0xf4/0x108
      ret_from_fork+0x10/0x20

Signed-off-by: D Scott Phillips <scott@os.amperecomputing.com>
---
Link to v1: https://lore.kernel.org/all/20240703210707.1986816-1-scott@os.amperecomputing.com/
Changes since v1:
 - Change from fiddling the architecture's MAX_PHYSMEM_BITS to checking
   arch_get_mappable_range().

 kernel/resource.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

Dan Williams July 9, 2024, 1:31 a.m. UTC | #1
D Scott Phillips wrote:
> On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order
> of struct page size to dimension region"), the amdgpu driver could trip
> over the warning of:
> 
> `WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));`
> 
> in vmemmap_populate()[1]. After that commit, it becomes a translation fault
> and panic[2].
> 
> The cause is that the amdgpu driver allocates some unused space from
> iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and
> devm_memremap_pages() it. An address above those backed by the arm64
> vmemmap is picked.
> 
> Limit request_free_mem_region() so that only addresses within the
> arch_get_mappable_range() can be chosen as device private addresses.

It seems odd that devm_request_free_mem_region() needs to be careful
about this restriction. The caller passes in the resource tree that is
the bounds of valid address ranges. This change assumes that the caller
wants to be restricted to vmemmap capable address ranges beyond the
restrictions it already requested in the passed in @base argument. That
restriction may be true with respect to request_mem_region(), but not
necessarily other users of get_free_mem_region() like
alloc_free_mem_region().

So, 2 questions / change request options:

1/ Preferred: Is there a possibility for the AMD driver to trim the
resource it is passing to be bound by arch_get_mappable_range()? For CXL
this is achieved by inserting CXL aperture windows into the resource
tree.

In the future what happens in the MEMORY_DEVICE_PUBLIC case when the
memory address is picked by a hardware aperture on the device? It occurs
to me if that aperture is communicated to the device via some platform
mechanism (to honor arch_get_mappable_range() restrictions), then maybe
the same should be done here.

I have always cringed at the request_free_mem_region() implementation
playing fast and loose with the platform memory map. Maybe this episode
is a sign that these constraints need more formal handling in the
resource tree.

I.e. IORES_DESC_DEVICE_PRIVATE_MEMORY becomes a platform communicated
aperture rather than hoping that unused portions of iomem_resource can
be repurposed like this.


2/ If option 1/ proves too difficult, can you rework the consideration
of @base to be gated by @desc?

Something like:

if (desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
	base_start = max(base->start, arch_get_mappable_range().start);
	base_end = min(base->end, arch_get_mappable_range().end;
else
	base_start = base->start;
	base_end = base->end;

...to localize the consideration of arch_get_mappable_range() only to
the places it is absolutely required?
diff mbox series

Patch

diff --git a/kernel/resource.c b/kernel/resource.c
index fcbca39dbc450..6f256aa0191b4 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1832,25 +1832,28 @@  static resource_size_t gfr_start(struct resource *base, resource_size_t size,
 	if (flags & GFR_DESCENDING) {
 		resource_size_t end;
 
-		end = min_t(resource_size_t, base->end,
+		end = min3(base->end, arch_get_mappable_range().end,
 			    (1ULL << MAX_PHYSMEM_BITS) - 1);
 		return end - size + 1;
 	}
 
-	return ALIGN(base->start, align);
+	return ALIGN(max_t(resource_size_t, base->start,
+			   arch_get_mappable_range().start), align);
 }
 
 static bool gfr_continue(struct resource *base, resource_size_t addr,
 			 resource_size_t size, unsigned long flags)
 {
+
 	if (flags & GFR_DESCENDING)
-		return addr > size && addr >= base->start;
+		return addr > size && addr >= base->start &&
+		       addr >= arch_get_mappable_range().start;
 	/*
 	 * In the ascend case be careful that the last increment by
 	 * @size did not wrap 0.
 	 */
 	return addr > addr - size &&
-	       addr <= min_t(resource_size_t, base->end,
+	       addr <= min3(base->end, arch_get_mappable_range().end,
 			     (1ULL << MAX_PHYSMEM_BITS) - 1);
 }