Message ID | 20210527230809.3701-1-Felix.Kuehling@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | Support DEVICE_GENERIC memory in migrate_vma_* | expand |
On Thu, May 27, 2021 at 07:08:04PM -0400, Felix Kuehling wrote: > Now we're trying to migrate data to and from that memory using the > migrate_vma_* helpers so we can support page-based migration in our > unified memory allocations, while also supporting CPU access to those > pages. So you have completely coherent and indistinguishable GPU and CPU memory and the need of migration is basicaly alot like NUMA policy choice - get better access locality? > This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages > behave correctly in the migrate_vma_* helpers. We are looking for feedback > about this approach. If we're close, what's needed to make our patches > acceptable upstream? If we're not close, any suggestions how else to > achieve what we are trying to do (i.e. page migration and coherent CPU > access to VRAM)? I'm not an expert in migrate, but it doesn't look outrageous. Have you thought about allowing MEMORY_DEVICE_GENERIC to work with hmm_range_fault() so you can have nice uniform RDMA? People have wanted to do that with MEMORY_DEVICE_PRIVATE but nobody finished the work Jason
Am 2021-05-28 um 9:08 a.m. schrieb Jason Gunthorpe: > On Thu, May 27, 2021 at 07:08:04PM -0400, Felix Kuehling wrote: >> Now we're trying to migrate data to and from that memory using the >> migrate_vma_* helpers so we can support page-based migration in our >> unified memory allocations, while also supporting CPU access to those >> pages. > So you have completely coherent and indistinguishable GPU and CPU > memory and the need of migration is basicaly alot like NUMA policy > choice - get better access locality? Yes. For a typical GPU compute application it means the GPU gets the best bandwidth/latency, and the CPU can coherently access the results without page faults and migrations. That's especially valuable for applications with persistent compute kernels that want to exploit concurrency between CPU and GPU. > >> This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages >> behave correctly in the migrate_vma_* helpers. We are looking for feedback >> about this approach. If we're close, what's needed to make our patches >> acceptable upstream? If we're not close, any suggestions how else to >> achieve what we are trying to do (i.e. page migration and coherent CPU >> access to VRAM)? > I'm not an expert in migrate, but it doesn't look outrageous. > > Have you thought about allowing MEMORY_DEVICE_GENERIC to work with > hmm_range_fault() so you can have nice uniform RDMA? Yes. That's our plan for RDMA to unified memory on this system. My understanding was, that DEVICE_GENERIC pages should already work with hmm_range_fault. But maybe I'm missing something. > > People have wanted to do that with MEMORY_DEVICE_PRIVATE but nobody > finished the work Yeah, for DEVICE_PRIVATE it seems more tricky because the peer device is not the owner of the pages and would need help from the actual owner to get proper DMA addresses. Regards, Felix > > Jason
Am 2021-05-29 um 2:41 a.m. schrieb Christoph Hellwig: > On Fri, May 28, 2021 at 11:56:36AM -0400, Felix Kuehling wrote: >> Am 2021-05-28 um 9:08 a.m. schrieb Jason Gunthorpe: >>> On Thu, May 27, 2021 at 07:08:04PM -0400, Felix Kuehling wrote: >>>> Now we're trying to migrate data to and from that memory using the >>>> migrate_vma_* helpers so we can support page-based migration in our >>>> unified memory allocations, while also supporting CPU access to those >>>> pages. >>> So you have completely coherent and indistinguishable GPU and CPU >>> memory and the need of migration is basicaly alot like NUMA policy >>> choice - get better access locality? >> Yes. For a typical GPU compute application it means the GPU gets the >> best bandwidth/latency, and the CPU can coherently access the results >> without page faults and migrations. That's especially valuable for >> applications with persistent compute kernels that want to exploit >> concurrency between CPU and GPU. > So why not expose the GPU memory as a CPUless memory node? We did consider this, and are in fact still considering it for future systems. For this system we decided not to go that way for several reasons. For one, it means the driver would need to allocate VRAM with __alloc_pages_nodemask for its own needs (firmware blobs, page tables, etc.) and traditional BO-based memory allocation APIs. The GPU driver would compete for VRAM with other application allocations, such as malloc, mmap, page cache etc. Benchmarking and optimizing the NUMA policy for such a system with a wide variety of workloads would be a big effort. All VRAM would need to be 0-initialized at allocation time. (I know about init_on_free=1. In fact that's what our GPU driver does for VRAM today, but asynchronously to hide the latency. However, init_on_free is synchronous and has other drawbacks for system memory according to the documentation of config INIT_ON_FREE_DEFAULT_ON.) To make virtualization work, GPU access to its own local VRAM would need to go through the system IOMMU. Regards, Felix