Message ID | 20240426154138.64643-2-lyude@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] drm/nouveau/firmware: Fix SG_DEBUG error with nvkm_firmware_ctor() | expand |
> Currently, this can result in runtime PM issues on systems where memory > Luckily, we don't actually need to allocate coherent memory for the page > table thanks to being able to pass the GPU a radix3 page table for > suspend/resume data. So, let's rewrite nvkm_gsp_radix3_sg() to use the sg > allocator for level 2. We continue using coherent allocations for lvl0 and > 1, since they only take a single page. > > Signed-off-by: Lyude Paul <lyude@redhat.com> > Cc: stable@vger.kernel.org > --- > .../gpu/drm/nouveau/include/nvkm/subdev/gsp.h | 4 +- > .../gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 71 ++++++++++++------- > 2 files changed, 47 insertions(+), 28 deletions(-) > > diff --git a/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h b/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h > index 6f5d376d8fcc1..a11d16a16c3b2 100644 > --- a/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h > +++ b/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h > @@ -15,7 +15,9 @@ struct nvkm_gsp_mem { > }; > > struct nvkm_gsp_radix3 { > - struct nvkm_gsp_mem mem[3]; > + struct nvkm_gsp_mem lvl0; > + struct nvkm_gsp_mem lvl1; > + struct sg_table lvl2; This looks great, could we go a step further and combine lvl0 and lvl1 into a 2 page allocation, I thought we could combine lvl0/lvl1 into a 2 page alloc, but that actually might be a bad idea under memory pressure. Dave.
On Mon, 2024-04-29 at 16:03 +1000, Dave Airlie wrote: > > Currently, this can result in runtime PM issues on systems where > > memory > > Luckily, we don't actually need to allocate coherent memory for the > > page > > table thanks to being able to pass the GPU a radix3 page table for > > suspend/resume data. So, let's rewrite nvkm_gsp_radix3_sg() to use > > the sg > > allocator for level 2. We continue using coherent allocations for > > lvl0 and > > 1, since they only take a single page. > > > > Signed-off-by: Lyude Paul <lyude@redhat.com> > > Cc: stable@vger.kernel.org > > --- > > .../gpu/drm/nouveau/include/nvkm/subdev/gsp.h | 4 +- > > .../gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 71 ++++++++++++--- > > ---- > > 2 files changed, 47 insertions(+), 28 deletions(-) > > > > diff --git a/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h > > b/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h > > index 6f5d376d8fcc1..a11d16a16c3b2 100644 > > --- a/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h > > +++ b/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h > > @@ -15,7 +15,9 @@ struct nvkm_gsp_mem { > > }; > > > > struct nvkm_gsp_radix3 { > > - struct nvkm_gsp_mem mem[3]; > > + struct nvkm_gsp_mem lvl0; > > + struct nvkm_gsp_mem lvl1; > > + struct sg_table lvl2; > > This looks great, could we go a step further and combine lvl0 and > lvl1 > into a 2 page allocation, I thought we could combine lvl0/lvl1 into a > 2 page alloc, but that actually might be a bad idea under memory > pressure. I'm not sure I understand :P, do we want to go for that or not? TBH - I'm not sure there's any hardware reason we wouldn't be able to do the whole radix3 table as an sg allocation with two additional memory pages added on for level 0 and 1 - since both of those can only be the size of a single page anyway it probably doesn't make much of a difference. The main reason I didn't end up doing that though is because it would make the codepath in nvkm_radix3_sg() a lot uglier. We need the virtual addresses of level 0-2's first/only pages to populate them, and we also need the DMA addresses of level 1-2. There isn't an iterator that lets you go through both DMA/virtual addresses as far as I can tell - and even if there was we'd start having to keep track of when we reach the end of a page in the loop and make sure that we always set pte to the address of the third sg page on the first iteration of the loop. IMO, scatterlist could definitely benefit from having an iterator that does both and can be stepped through both in and out of for loop macros (like Iterator in rust). So - it's definitely possible, but considering: * nvkm_gsp_mem isn't a very big struct * We're only allocating a single page for level 0 and 1, so at least according to the advice I got from Sima this should be a safe amount to allocate coherently under memory pressure. * It's just a lot easier code-wise having direct address to the DMA/virt addresses for the first two levels I decided to stay with nvkm_gsp_mem_ctor() for the first two pages and just use nvkm_gsp_sg() for the rest. I can definitely convert the whole thing to using nvkm_gsp_sg() if we really want though - but I don't think it'll give us much benefit. I'll send out the new version of the patch without these changes and a fix for one of the issues with this patch I already mentioned to Timur, just let me know what you end up deciding and I can revise the patch if you want. > > Dave. >
diff --git a/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h b/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h index 6f5d376d8fcc1..a11d16a16c3b2 100644 --- a/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h +++ b/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h @@ -15,7 +15,9 @@ struct nvkm_gsp_mem { }; struct nvkm_gsp_radix3 { - struct nvkm_gsp_mem mem[3]; + struct nvkm_gsp_mem lvl0; + struct nvkm_gsp_mem lvl1; + struct sg_table lvl2; }; int nvkm_gsp_sg(struct nvkm_device *, u64 size, struct sg_table *); diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c index 9858c1438aa7f..2bf9077d37118 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c @@ -1624,7 +1624,7 @@ r535_gsp_wpr_meta_init(struct nvkm_gsp *gsp) meta->magic = GSP_FW_WPR_META_MAGIC; meta->revision = GSP_FW_WPR_META_REVISION; - meta->sysmemAddrOfRadix3Elf = gsp->radix3.mem[0].addr; + meta->sysmemAddrOfRadix3Elf = gsp->radix3.lvl0.addr; meta->sizeOfRadix3Elf = gsp->fb.wpr2.elf.size; meta->sysmemAddrOfBootloader = gsp->boot.fw.addr; @@ -1919,8 +1919,9 @@ nvkm_gsp_sg(struct nvkm_device *device, u64 size, struct sg_table *sgt) static void nvkm_gsp_radix3_dtor(struct nvkm_gsp *gsp, struct nvkm_gsp_radix3 *rx3) { - for (int i = ARRAY_SIZE(rx3->mem) - 1; i >= 0; i--) - nvkm_gsp_mem_dtor(gsp, &rx3->mem[i]); + nvkm_gsp_sg_free(gsp->subdev.device, &rx3->lvl2); + nvkm_gsp_mem_dtor(gsp, &rx3->lvl1); + nvkm_gsp_mem_dtor(gsp, &rx3->lvl0); } /** @@ -1960,36 +1961,52 @@ static int nvkm_gsp_radix3_sg(struct nvkm_gsp *gsp, struct sg_table *sgt, u64 size, struct nvkm_gsp_radix3 *rx3) { - u64 addr; + struct sg_dma_page_iter sg_dma_iter; + struct scatterlist *sg; + size_t bufsize; + u64 *pte; + int ret, i, page_idx = 0; - for (int i = ARRAY_SIZE(rx3->mem) - 1; i >= 0; i--) { - u64 *ptes; - size_t bufsize; - int ret, idx; - - bufsize = ALIGN((size / GSP_PAGE_SIZE) * sizeof(u64), GSP_PAGE_SIZE); - ret = nvkm_gsp_mem_ctor(gsp, bufsize, &rx3->mem[i]); - if (ret) - return ret; + ret = nvkm_gsp_mem_ctor(gsp, GSP_PAGE_SIZE, &rx3->lvl0); + if (ret) + return ret; - ptes = rx3->mem[i].data; - if (i == 2) { - struct scatterlist *sgl; + ret = nvkm_gsp_mem_ctor(gsp, GSP_PAGE_SIZE, &rx3->lvl1); + if (ret) + goto lvl1_fail; - for_each_sgtable_dma_sg(sgt, sgl, idx) { - for (int j = 0; j < sg_dma_len(sgl) / GSP_PAGE_SIZE; j++) - *ptes++ = sg_dma_address(sgl) + (GSP_PAGE_SIZE * j); - } - } else { - for (int j = 0; j < size / GSP_PAGE_SIZE; j++) - *ptes++ = addr + GSP_PAGE_SIZE * j; + // Allocate level 2 + bufsize = ALIGN((size / GSP_PAGE_SIZE) * sizeof(u64), GSP_PAGE_SIZE); + ret = nvkm_gsp_sg(gsp->subdev.device, bufsize, &rx3->lvl2); + if (ret) + goto lvl2_fail; + + // Write the bus address of level 1 to level 0 + pte = rx3->lvl0.data; + *pte = rx3->lvl1.addr; + + // Write the bus address of each page in level 2 to level 1 + pte = rx3->lvl1.data; + for_each_sgtable_dma_page(&rx3->lvl2, &sg_dma_iter, 0) + *pte++ = sg_page_iter_dma_address(&sg_dma_iter); + + // Finally, write the bus address of each page in sgt to level 2 + for_each_sgtable_sg(&rx3->lvl2, sg, i) { + pte = sg_virt(sg); + for_each_sgtable_dma_page(sgt, &sg_dma_iter, page_idx) { + *pte++ = sg_page_iter_dma_address(&sg_dma_iter); + page_idx++; } + } - size = rx3->mem[i].size; - addr = rx3->mem[i].addr; + if (ret) { +lvl2_fail: + nvkm_gsp_mem_dtor(gsp, &rx3->lvl1); +lvl1_fail: + nvkm_gsp_mem_dtor(gsp, &rx3->lvl0); } - return 0; + return ret; } int @@ -2021,7 +2038,7 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend) sr = gsp->sr.meta.data; sr->magic = GSP_FW_SR_META_MAGIC; sr->revision = GSP_FW_SR_META_REVISION; - sr->sysmemAddrOfSuspendResumeData = gsp->sr.radix3.mem[0].addr; + sr->sysmemAddrOfSuspendResumeData = gsp->sr.radix3.lvl0.addr; sr->sizeOfSuspendResumeData = len; mbox0 = lower_32_bits(gsp->sr.meta.addr);
Currently we allocate all 3 levels of radix3 page tables using nvkm_gsp_mem_ctor(), which uses dma_alloc_coherent() for allocating all of the relevant memory. This can end up failing in scenarios where the system has very high memory fragmentation, and we can't find enough contiguous memory to allocate level 2 of the page table. Currently, this can result in runtime PM issues on systems where memory fragmentation is high - as we'll fail to allocate the page table for our suspend/resume buffer: kworker/10:2: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 10 PID: 479809 Comm: kworker/10:2 Not tainted 6.8.6-201.ChopperV6.fc39.x86_64 #1 Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024 Workqueue: pm pm_runtime_work Call Trace: <TASK> dump_stack_lvl+0x64/0x80 warn_alloc+0x165/0x1e0 ? __alloc_pages_direct_compact+0xb3/0x2b0 __alloc_pages_slowpath.constprop.0+0xd7d/0xde0 __alloc_pages+0x32d/0x350 __dma_direct_alloc_pages.isra.0+0x16a/0x2b0 dma_direct_alloc+0x70/0x270 nvkm_gsp_radix3_sg+0x5e/0x130 [nouveau] r535_gsp_fini+0x1d4/0x350 [nouveau] nvkm_subdev_fini+0x67/0x150 [nouveau] nvkm_device_fini+0x95/0x1e0 [nouveau] nvkm_udevice_fini+0x53/0x70 [nouveau] nvkm_object_fini+0xb9/0x240 [nouveau] nvkm_object_fini+0x75/0x240 [nouveau] nouveau_do_suspend+0xf5/0x280 [nouveau] nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau] pci_pm_runtime_suspend+0x67/0x1e0 ? __pfx_pci_pm_runtime_suspend+0x10/0x10 __rpm_callback+0x41/0x170 ? __pfx_pci_pm_runtime_suspend+0x10/0x10 rpm_callback+0x5d/0x70 ? __pfx_pci_pm_runtime_suspend+0x10/0x10 rpm_suspend+0x120/0x6a0 pm_runtime_work+0x98/0xb0 process_one_work+0x171/0x340 worker_thread+0x27b/0x3a0 ? __pfx_worker_thread+0x10/0x10 kthread+0xe5/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x31/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 Luckily, we don't actually need to allocate coherent memory for the page table thanks to being able to pass the GPU a radix3 page table for suspend/resume data. So, let's rewrite nvkm_gsp_radix3_sg() to use the sg allocator for level 2. We continue using coherent allocations for lvl0 and 1, since they only take a single page. Signed-off-by: Lyude Paul <lyude@redhat.com> Cc: stable@vger.kernel.org --- .../gpu/drm/nouveau/include/nvkm/subdev/gsp.h | 4 +- .../gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 71 ++++++++++++------- 2 files changed, 47 insertions(+), 28 deletions(-)