[RFC,0/5] Support DEVICE_GENERIC memory in migrate_vma_*

Message ID	20210527230809.3701-1-Felix.Kuehling@amd.com (mailing list archive)
Headers	show Return-Path: <SRS0=VCOr=KW=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 466DF613E3 From: Felix Kuehling <felix.kuehling@gmail.com> To: felix.kuehling@amd.com, akpm@linux-foundation.org, linux-mm@kvack.org Cc: hch@lst.de, jglisse@redhat.com, jgg@nvidia.com, dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org Subject: [RFC PATCH 0/5] Support DEVICE_GENERIC memory in migrate_vma_* Date: Thu, 27 May 2021 19:08:04 -0400 Message-Id: <20210527230809.3701-1-Felix.Kuehling@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Support DEVICE_GENERIC memory in migrate_vma_* \| expand [RFC,0/5] Support DEVICE_GENERIC memory in migrate_vma_* [RFC,1/5] drm/amdkfd: add SPM support for SVM [RFC,2/5] drm/amdkfd: generic type as sys mem on migration to ram [RFC,3/5] include/linux/mm.h: helper to check zone device generic type [RFC,4/5] mm: add generic type support for device zone page migration [RFC,5/5] mm: changes to unref pages with Generic type

Message ID

20210527230809.3701-1-Felix.Kuehling@amd.com (mailing list archive)

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 466DF613E3
From: Felix Kuehling <felix.kuehling@gmail.com>
To: felix.kuehling@amd.com,
	akpm@linux-foundation.org,
	linux-mm@kvack.org
Cc: hch@lst.de,
	jglisse@redhat.com,
	jgg@nvidia.com,
	dri-devel@lists.freedesktop.org,
	amd-gfx@lists.freedesktop.org
Subject: [RFC PATCH 0/5] Support DEVICE_GENERIC memory in migrate_vma_*
Date: Thu, 27 May 2021 19:08:04 -0400
Message-Id: <20210527230809.3701-1-Felix.Kuehling@amd.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Support DEVICE_GENERIC memory in migrate_vma_* | expand

Message

Felix Kuehling May 27, 2021, 11:08 p.m. UTC

AMD is building a system architecture for the Frontier supercomputer with
a coherent interconnect between CPUs and GPUs. This hardware architecture
allows the CPUs to coherently access GPU device memory. We have hardware
in our labs and we are working with our partner HPE on the BIOS, firmware
and software for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver
looks it up with lookup_resource and registers it with devmap as
MEMORY_DEVICE_GENERIC using devm_memremap_pages.

Now we're trying to migrate data to and from that memory using the
migrate_vma_* helpers so we can support page-based migration in our
unified memory allocations, while also supporting CPU access to those
pages.

This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages
behave correctly in the migrate_vma_* helpers. We are looking for feedback
about this approach. If we're close, what's needed to make our patches
acceptable upstream? If we're not close, any suggestions how else to
achieve what we are trying to do (i.e. page migration and coherent CPU
access to VRAM)?

This work is based on HMM and our SVM memory manager that was recently
upstreamed to Dave Airlie's drm-next branch
[https://cgit.freedesktop.org/drm/drm/log/?h=drm-next]. On top of that we
did some rework of our VRAM management for migrations to remove some
incorrect assumptions, allow partially successful migrations and GPU
memory mappings that mix pages in VRAM and system memory.
[https://patchwork.kernel.org/project/dri-devel/list/?series=489811]

In this RFC, patches 1 and 2 are for context to show how we are looking up
the SPM memory and registering it with devmap.

Patches 3-5 are the changes we are trying to upstream or rework to make
them acceptable upstream.

Alex Sierra (5):
  drm/amdkfd: add SPM support for SVM
  drm/amdkfd: generic type as sys mem on migration to ram
  include/linux/mm.h: helper to check zone device generic type
  mm: add generic type support for device zone page migration
  mm: changes to unref pages with Generic type

 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 +++++++++++----
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |  1 -
 include/linux/mm.h                       |  8 ++++++++
 kernel/resource.c                        |  2 +-
 mm/memremap.c                            |  5 ++++-
 mm/migrate.c                             | 13 ++++++++-----
 6 files changed, 32 insertions(+), 12 deletions(-)

Comments

Jason Gunthorpe May 28, 2021, 1:08 p.m. UTC | #1

On Thu, May 27, 2021 at 07:08:04PM -0400, Felix Kuehling wrote:
> Now we're trying to migrate data to and from that memory using the
> migrate_vma_* helpers so we can support page-based migration in our
> unified memory allocations, while also supporting CPU access to those
> pages.

So you have completely coherent and indistinguishable GPU and CPU
memory and the need of migration is basicaly alot like NUMA policy
choice - get better access locality?

> This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages
> behave correctly in the migrate_vma_* helpers. We are looking for feedback
> about this approach. If we're close, what's needed to make our patches
> acceptable upstream? If we're not close, any suggestions how else to
> achieve what we are trying to do (i.e. page migration and coherent CPU
> access to VRAM)?

I'm not an expert in migrate, but it doesn't look outrageous.

Have you thought about allowing MEMORY_DEVICE_GENERIC to work with
hmm_range_fault() so you can have nice uniform RDMA?

People have wanted to do that with MEMORY_DEVICE_PRIVATE but nobody
finished the work

Jason

Felix Kuehling May 28, 2021, 3:56 p.m. UTC | #2

Am 2021-05-28 um 9:08 a.m. schrieb Jason Gunthorpe:
> On Thu, May 27, 2021 at 07:08:04PM -0400, Felix Kuehling wrote:
>> Now we're trying to migrate data to and from that memory using the
>> migrate_vma_* helpers so we can support page-based migration in our
>> unified memory allocations, while also supporting CPU access to those
>> pages.
> So you have completely coherent and indistinguishable GPU and CPU
> memory and the need of migration is basicaly alot like NUMA policy
> choice - get better access locality?

Yes. For a typical GPU compute application it means the GPU gets the
best bandwidth/latency, and the CPU can coherently access the results
without page faults and migrations. That's especially valuable for
applications with persistent compute kernels that want to exploit
concurrency between CPU and GPU.

>  
>> This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages
>> behave correctly in the migrate_vma_* helpers. We are looking for feedback
>> about this approach. If we're close, what's needed to make our patches
>> acceptable upstream? If we're not close, any suggestions how else to
>> achieve what we are trying to do (i.e. page migration and coherent CPU
>> access to VRAM)?
> I'm not an expert in migrate, but it doesn't look outrageous.
>
> Have you thought about allowing MEMORY_DEVICE_GENERIC to work with
> hmm_range_fault() so you can have nice uniform RDMA?

Yes. That's our plan for RDMA to unified memory on this system. My
understanding was, that DEVICE_GENERIC pages should already work with
hmm_range_fault. But maybe I'm missing something.

>
> People have wanted to do that with MEMORY_DEVICE_PRIVATE but nobody
> finished the work

Yeah, for DEVICE_PRIVATE it seems more tricky because the peer device is
not the owner of the pages and would need help from the actual owner to
get proper DMA addresses.

Regards,
  Felix

>
> Jason

Christoph Hellwig May 29, 2021, 6:41 a.m. UTC | #3

On Fri, May 28, 2021 at 11:56:36AM -0400, Felix Kuehling wrote:
> Am 2021-05-28 um 9:08 a.m. schrieb Jason Gunthorpe:
> > On Thu, May 27, 2021 at 07:08:04PM -0400, Felix Kuehling wrote:
> >> Now we're trying to migrate data to and from that memory using the
> >> migrate_vma_* helpers so we can support page-based migration in our
> >> unified memory allocations, while also supporting CPU access to those
> >> pages.
> > So you have completely coherent and indistinguishable GPU and CPU
> > memory and the need of migration is basicaly alot like NUMA policy
> > choice - get better access locality?
> 
> Yes. For a typical GPU compute application it means the GPU gets the
> best bandwidth/latency, and the CPU can coherently access the results
> without page faults and migrations. That's especially valuable for
> applications with persistent compute kernels that want to exploit
> concurrency between CPU and GPU.

So why not expose the GPU memory as a CPUless memory node?

Felix Kuehling May 29, 2021, 6:37 p.m. UTC | #4

Am 2021-05-29 um 2:41 a.m. schrieb Christoph Hellwig:
> On Fri, May 28, 2021 at 11:56:36AM -0400, Felix Kuehling wrote:
>> Am 2021-05-28 um 9:08 a.m. schrieb Jason Gunthorpe:
>>> On Thu, May 27, 2021 at 07:08:04PM -0400, Felix Kuehling wrote:
>>>> Now we're trying to migrate data to and from that memory using the
>>>> migrate_vma_* helpers so we can support page-based migration in our
>>>> unified memory allocations, while also supporting CPU access to those
>>>> pages.
>>> So you have completely coherent and indistinguishable GPU and CPU
>>> memory and the need of migration is basicaly alot like NUMA policy
>>> choice - get better access locality?
>> Yes. For a typical GPU compute application it means the GPU gets the
>> best bandwidth/latency, and the CPU can coherently access the results
>> without page faults and migrations. That's especially valuable for
>> applications with persistent compute kernels that want to exploit
>> concurrency between CPU and GPU.
> So why not expose the GPU memory as a CPUless memory node?

We did consider this, and are in fact still considering it for future
systems. For this system we decided not to go that way for several reasons.

For one, it means the driver would need to allocate VRAM with
__alloc_pages_nodemask for its own needs (firmware blobs, page tables,
etc.) and traditional BO-based memory allocation APIs. The GPU driver
would compete for VRAM with other application allocations, such as
malloc, mmap, page cache etc. Benchmarking and optimizing the NUMA
policy for such a system with a wide variety of workloads would be a big
effort.

All VRAM would need to be 0-initialized at allocation time. (I know
about init_on_free=1. In fact that's what our GPU driver does for VRAM
today, but asynchronously to hide the latency. However, init_on_free is
synchronous and has other drawbacks for system memory according to the
documentation of config INIT_ON_FREE_DEFAULT_ON.)

To make virtualization work, GPU access to its own local VRAM would need
to go through the system IOMMU.

Regards,
  Felix