Message ID | 20240709002757.2431399-1-scott@os.amperecomputing.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] resource: limit request_free_mem_region based on arch_get_mappable_range | expand |
D Scott Phillips wrote: > On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order > of struct page size to dimension region"), the amdgpu driver could trip > over the warning of: > > `WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));` > > in vmemmap_populate()[1]. After that commit, it becomes a translation fault > and panic[2]. > > The cause is that the amdgpu driver allocates some unused space from > iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and > devm_memremap_pages() it. An address above those backed by the arm64 > vmemmap is picked. > > Limit request_free_mem_region() so that only addresses within the > arch_get_mappable_range() can be chosen as device private addresses. It seems odd that devm_request_free_mem_region() needs to be careful about this restriction. The caller passes in the resource tree that is the bounds of valid address ranges. This change assumes that the caller wants to be restricted to vmemmap capable address ranges beyond the restrictions it already requested in the passed in @base argument. That restriction may be true with respect to request_mem_region(), but not necessarily other users of get_free_mem_region() like alloc_free_mem_region(). So, 2 questions / change request options: 1/ Preferred: Is there a possibility for the AMD driver to trim the resource it is passing to be bound by arch_get_mappable_range()? For CXL this is achieved by inserting CXL aperture windows into the resource tree. In the future what happens in the MEMORY_DEVICE_PUBLIC case when the memory address is picked by a hardware aperture on the device? It occurs to me if that aperture is communicated to the device via some platform mechanism (to honor arch_get_mappable_range() restrictions), then maybe the same should be done here. I have always cringed at the request_free_mem_region() implementation playing fast and loose with the platform memory map. Maybe this episode is a sign that these constraints need more formal handling in the resource tree. I.e. IORES_DESC_DEVICE_PRIVATE_MEMORY becomes a platform communicated aperture rather than hoping that unused portions of iomem_resource can be repurposed like this. 2/ If option 1/ proves too difficult, can you rework the consideration of @base to be gated by @desc? Something like: if (desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) base_start = max(base->start, arch_get_mappable_range().start); base_end = min(base->end, arch_get_mappable_range().end; else base_start = base->start; base_end = base->end; ...to localize the consideration of arch_get_mappable_range() only to the places it is absolutely required?
diff --git a/kernel/resource.c b/kernel/resource.c index fcbca39dbc450..6f256aa0191b4 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1832,25 +1832,28 @@ static resource_size_t gfr_start(struct resource *base, resource_size_t size, if (flags & GFR_DESCENDING) { resource_size_t end; - end = min_t(resource_size_t, base->end, + end = min3(base->end, arch_get_mappable_range().end, (1ULL << MAX_PHYSMEM_BITS) - 1); return end - size + 1; } - return ALIGN(base->start, align); + return ALIGN(max_t(resource_size_t, base->start, + arch_get_mappable_range().start), align); } static bool gfr_continue(struct resource *base, resource_size_t addr, resource_size_t size, unsigned long flags) { + if (flags & GFR_DESCENDING) - return addr > size && addr >= base->start; + return addr > size && addr >= base->start && + addr >= arch_get_mappable_range().start; /* * In the ascend case be careful that the last increment by * @size did not wrap 0. */ return addr > addr - size && - addr <= min_t(resource_size_t, base->end, + addr <= min3(base->end, arch_get_mappable_range().end, (1ULL << MAX_PHYSMEM_BITS) - 1); }
On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order of struct page size to dimension region"), the amdgpu driver could trip over the warning of: `WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));` in vmemmap_populate()[1]. After that commit, it becomes a translation fault and panic[2]. The cause is that the amdgpu driver allocates some unused space from iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and devm_memremap_pages() it. An address above those backed by the arm64 vmemmap is picked. Limit request_free_mem_region() so that only addresses within the arch_get_mappable_range() can be chosen as device private addresses. [1]: Call trace: vmemmap_populate+0x30/0x48 __populate_section_memmap+0x40/0x90 sparse_add_section+0xfc/0x3e8 __add_pages+0xb4/0x168 pagemap_range+0x300/0x410 memremap_pages+0x184/0x2d8 devm_memremap_pages+0x30/0x90 kgd2kfd_init_zone_device+0xe0/0x1f0 [amdgpu] amdgpu_device_ip_init+0x674/0x888 [amdgpu] amdgpu_device_init+0x7bc/0xed8 [amdgpu] amdgpu_driver_load_kms+0x28/0x1c0 [amdgpu] amdgpu_pci_probe+0x194/0x580 [amdgpu] local_pci_probe+0x48/0xb8 work_for_cpu_fn+0x24/0x40 process_one_work+0x170/0x3e0 worker_thread+0x2ac/0x3e0 kthread+0xf4/0x108 ret_from_fork+0x10/0x20 [2]: Unable to handle kernel paging request at virtual address 000001ffa6000034 Mem abort info: ESR = 0x0000000096000044 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000 CM = 0, WnR = 1, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=000008000287c000 [000001ffa6000034] pgd=0000000000000000, p4d=0000000000000000 Call trace: __init_zone_device_page.constprop.0+0x2c/0xa8 memmap_init_zone_device+0xf0/0x210 pagemap_range+0x1e0/0x410 memremap_pages+0x18c/0x2e0 devm_memremap_pages+0x30/0x90 kgd2kfd_init_zone_device+0xf0/0x200 [amdgpu] amdgpu_device_ip_init+0x674/0x888 [amdgpu] amdgpu_device_init+0x7a4/0xea0 [amdgpu] amdgpu_driver_load_kms+0x28/0x1c0 [amdgpu] amdgpu_pci_probe+0x1a0/0x560 [amdgpu] local_pci_probe+0x48/0xb8 work_for_cpu_fn+0x24/0x40 process_one_work+0x170/0x3e0 worker_thread+0x2ac/0x3e0 kthread+0xf4/0x108 ret_from_fork+0x10/0x20 Signed-off-by: D Scott Phillips <scott@os.amperecomputing.com> --- Link to v1: https://lore.kernel.org/all/20240703210707.1986816-1-scott@os.amperecomputing.com/ Changes since v1: - Change from fiddling the architecture's MAX_PHYSMEM_BITS to checking arch_get_mappable_range(). kernel/resource.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)