mbox series

[v3,00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

Message ID 20220110223201.31024-1-alex.sierra@amd.com (mailing list archive)
Headers show
Series Add MEMORY_DEVICE_COHERENT for coherent device memory mapping | expand

Message

Sierra Guiza, Alejandro (Alex) Jan. 10, 2022, 10:31 p.m. UTC
This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
owned by a device that can be mapped into CPU page tables like
MEMORY_DEVICE_GENERIC and can also be migrated like
MEMORY_DEVICE_PRIVATE.

Christoph, the suggestion to incorporate Ralph Campbell’s refcount
cleanup patch into our hardware page migration patchset originally came
from you, but it proved impractical to do things in that order because
the refcount cleanup introduced a bug with wide ranging structural
implications. Instead, we amended Ralph’s patch so that it could be
applied after merging the migration work. As we saw from the recent
discussion, merging the refcount work is going to take some time and
cooperation between multiple development groups, while the migration
work is ready now and is needed now. So we propose to merge this
patchset first and continue to work with Ralph and others to merge the
refcount cleanup separately, when it is ready.

This patch series is mostly self-contained except for a few places where
it needs to update other subsystems to handle the new memory type.
System stability and performance are not affected according to our
ongoing testing, including xfstests.

How it works: The system BIOS advertises the GPU device memory
(aka VRAM) as SPM (special purpose memory) in the UEFI system address
map.

The amdgpu driver registers the memory with devmap as
MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
this hardware page migration capability is the Frontier supercomputer
project. This functionality is not AMD-specific. We expect other GPU
vendors to find this functionality useful, and possibly other hardware
types in the future.

Our test nodes in the lab are similar to the Frontier configuration,
with .5 TB of system memory plus 256 GB of device memory split across
4 GPUs, all in a single coherent address space. Page migration is
expected to improve application efficiency significantly. We will
report empirical results as they become available.

We extended hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This
patch set builds on HMM and our SVM memory manager already merged in
5.15.

v2:
- test_hmm is now able to create private and coherent device mirror
instances in the same driver probe. This adds more usability to the hmm
test by not having to remove the kernel module for each device type
test (private/coherent type). This is done by passing the module
parameters spm_addr_dev0 & spm_addr_dev1. In this case, it will create
four instances of device_mirror. The first two correspond to private
device type, the last two to coherent type. Then, they can be easily
accessed from user space through /dev/hmm_mirror<num_device>. Usually
num_device 0 and 1 are for private, and 2 and 3 for coherent types.

- Coherent device type pages at gup are now migrated back to system
memory if they have been long term pinned (FOLL_LONGTERM). The reason
is these pages could eventually interfere with their own device memory
manager. A new hmm_gup_test has been added to the hmm-test to test this
functionality. It makes use of the gup_test module to long term pin
user pages that have been migrate to device memory first.

- Other patch corrections made by Felix, Alistair and Christoph.

v3:
- Based on last v2 feedback we got from Alistair, we've decided to
remove migration logic for FOLL_LONGTERM coherent device type pages at
gup for now. Ideally, this should be done through the kernel mm,
instead of calling the device driver to do it. Currently, there's no
support for migrating device pages based on pfn, mainly because
migrate_pages() relies on pages being LRU pages. Alistair mentioned, he
has started to work on adding this migrate device pages logic. For now,
we fail on get_user_pages call with FOLL_LONGTERM for DEVICE_COHERENT
pages.

- Also, hmm_gup_test has been removed from hmm-test. We plan to include
it again after this migration work is ready.

- Addressed Liam Howlett's feedback changes.

Alex Sierra (10):
  mm: add zone device coherent type memory support
  mm: add device coherent vma selection for memory migration
  mm/gup: fail get_user_pages for LONGTERM dev coherent type
  drm/amdkfd: add SPM support for SVM
  drm/amdkfd: coherent type as sys mem on migration to ram
  lib: test_hmm add ioctl to get zone device type
  lib: test_hmm add module param for zone device type
  lib: add support for device coherent type in test_hmm
  tools: update hmm-test to support device coherent type
  tools: update test_hmm script to support SP config

 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  34 ++-
 include/linux/memremap.h                 |   8 +
 include/linux/migrate.h                  |   1 +
 include/linux/mm.h                       |  16 ++
 lib/test_hmm.c                           | 333 +++++++++++++++++------
 lib/test_hmm_uapi.h                      |  22 +-
 mm/gup.c                                 |   7 +
 mm/memcontrol.c                          |   6 +-
 mm/memory-failure.c                      |   8 +-
 mm/memremap.c                            |   5 +-
 mm/migrate.c                             |  30 +-
 tools/testing/selftests/vm/hmm-tests.c   | 122 +++++++--
 tools/testing/selftests/vm/test_hmm.sh   |  24 +-
 13 files changed, 475 insertions(+), 141 deletions(-)

Comments

Alistair Popple Jan. 12, 2022, 11:06 a.m. UTC | #1
I have been looking at this in relation to the migration code and noticed we
have the following in try_to_migrate():

        if (is_zone_device_page(page) && !is_device_private_page(page))
                return;

Which if I'm understanding correctly means that migration of device coherent
pages will always fail. Given that I do wonder how hmm-tests are passing, but
I assume you must always be hitting this fast path in
migrate_vma_collect_pmd():

                /*
                 * Optimize for the common case where page is only mapped once
                 * in one process. If we can lock the page, then we can safely
                 * set up a special migration page table entry now.
                 */

Meaning that try_to_migrate() never gets called from migrate_vma_unmap(). So
you will also need some changes to try_to_migrate() and possibly
try_to_migrate_one() to make this reliable.

 - Alistair

On Tuesday, 11 January 2022 9:31:51 AM AEDT Alex Sierra wrote:
> This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
> owned by a device that can be mapped into CPU page tables like
> MEMORY_DEVICE_GENERIC and can also be migrated like
> MEMORY_DEVICE_PRIVATE.
> 
> Christoph, the suggestion to incorporate Ralph Campbell’s refcount
> cleanup patch into our hardware page migration patchset originally came
> from you, but it proved impractical to do things in that order because
> the refcount cleanup introduced a bug with wide ranging structural
> implications. Instead, we amended Ralph’s patch so that it could be
> applied after merging the migration work. As we saw from the recent
> discussion, merging the refcount work is going to take some time and
> cooperation between multiple development groups, while the migration
> work is ready now and is needed now. So we propose to merge this
> patchset first and continue to work with Ralph and others to merge the
> refcount cleanup separately, when it is ready.
> 
> This patch series is mostly self-contained except for a few places where
> it needs to update other subsystems to handle the new memory type.
> System stability and performance are not affected according to our
> ongoing testing, including xfstests.
> 
> How it works: The system BIOS advertises the GPU device memory
> (aka VRAM) as SPM (special purpose memory) in the UEFI system address
> map.
> 
> The amdgpu driver registers the memory with devmap as
> MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
> this hardware page migration capability is the Frontier supercomputer
> project. This functionality is not AMD-specific. We expect other GPU
> vendors to find this functionality useful, and possibly other hardware
> types in the future.
> 
> Our test nodes in the lab are similar to the Frontier configuration,
> with .5 TB of system memory plus 256 GB of device memory split across
> 4 GPUs, all in a single coherent address space. Page migration is
> expected to improve application efficiency significantly. We will
> report empirical results as they become available.
> 
> We extended hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This
> patch set builds on HMM and our SVM memory manager already merged in
> 5.15.
> 
> v2:
> - test_hmm is now able to create private and coherent device mirror
> instances in the same driver probe. This adds more usability to the hmm
> test by not having to remove the kernel module for each device type
> test (private/coherent type). This is done by passing the module
> parameters spm_addr_dev0 & spm_addr_dev1. In this case, it will create
> four instances of device_mirror. The first two correspond to private
> device type, the last two to coherent type. Then, they can be easily
> accessed from user space through /dev/hmm_mirror<num_device>. Usually
> num_device 0 and 1 are for private, and 2 and 3 for coherent types.
> 
> - Coherent device type pages at gup are now migrated back to system
> memory if they have been long term pinned (FOLL_LONGTERM). The reason
> is these pages could eventually interfere with their own device memory
> manager. A new hmm_gup_test has been added to the hmm-test to test this
> functionality. It makes use of the gup_test module to long term pin
> user pages that have been migrate to device memory first.
> 
> - Other patch corrections made by Felix, Alistair and Christoph.
> 
> v3:
> - Based on last v2 feedback we got from Alistair, we've decided to
> remove migration logic for FOLL_LONGTERM coherent device type pages at
> gup for now. Ideally, this should be done through the kernel mm,
> instead of calling the device driver to do it. Currently, there's no
> support for migrating device pages based on pfn, mainly because
> migrate_pages() relies on pages being LRU pages. Alistair mentioned, he
> has started to work on adding this migrate device pages logic. For now,
> we fail on get_user_pages call with FOLL_LONGTERM for DEVICE_COHERENT
> pages.
> 
> - Also, hmm_gup_test has been removed from hmm-test. We plan to include
> it again after this migration work is ready.
> 
> - Addressed Liam Howlett's feedback changes.
> 
> Alex Sierra (10):
>   mm: add zone device coherent type memory support
>   mm: add device coherent vma selection for memory migration
>   mm/gup: fail get_user_pages for LONGTERM dev coherent type
>   drm/amdkfd: add SPM support for SVM
>   drm/amdkfd: coherent type as sys mem on migration to ram
>   lib: test_hmm add ioctl to get zone device type
>   lib: test_hmm add module param for zone device type
>   lib: add support for device coherent type in test_hmm
>   tools: update hmm-test to support device coherent type
>   tools: update test_hmm script to support SP config
> 
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  34 ++-
>  include/linux/memremap.h                 |   8 +
>  include/linux/migrate.h                  |   1 +
>  include/linux/mm.h                       |  16 ++
>  lib/test_hmm.c                           | 333 +++++++++++++++++------
>  lib/test_hmm_uapi.h                      |  22 +-
>  mm/gup.c                                 |   7 +
>  mm/memcontrol.c                          |   6 +-
>  mm/memory-failure.c                      |   8 +-
>  mm/memremap.c                            |   5 +-
>  mm/migrate.c                             |  30 +-
>  tools/testing/selftests/vm/hmm-tests.c   | 122 +++++++--
>  tools/testing/selftests/vm/test_hmm.sh   |  24 +-
>  13 files changed, 475 insertions(+), 141 deletions(-)
> 
>
David Hildenbrand Jan. 12, 2022, 11:16 a.m. UTC | #2
On 10.01.22 23:31, Alex Sierra wrote:
> This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
> owned by a device that can be mapped into CPU page tables like
> MEMORY_DEVICE_GENERIC and can also be migrated like
> MEMORY_DEVICE_PRIVATE.
> 
> Christoph, the suggestion to incorporate Ralph Campbell’s refcount
> cleanup patch into our hardware page migration patchset originally came
> from you, but it proved impractical to do things in that order because
> the refcount cleanup introduced a bug with wide ranging structural
> implications. Instead, we amended Ralph’s patch so that it could be
> applied after merging the migration work. As we saw from the recent
> discussion, merging the refcount work is going to take some time and
> cooperation between multiple development groups, while the migration
> work is ready now and is needed now. So we propose to merge this
> patchset first and continue to work with Ralph and others to merge the
> refcount cleanup separately, when it is ready.
> 
> This patch series is mostly self-contained except for a few places where
> it needs to update other subsystems to handle the new memory type.
> System stability and performance are not affected according to our
> ongoing testing, including xfstests.
> 
> How it works: The system BIOS advertises the GPU device memory
> (aka VRAM) as SPM (special purpose memory) in the UEFI system address
> map.
> 
> The amdgpu driver registers the memory with devmap as
> MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
> this hardware page migration capability is the Frontier supercomputer
> project. This functionality is not AMD-specific. We expect other GPU
> vendors to find this functionality useful, and possibly other hardware
> types in the future.
> 
> Our test nodes in the lab are similar to the Frontier configuration,
> with .5 TB of system memory plus 256 GB of device memory split across
> 4 GPUs, all in a single coherent address space. Page migration is
> expected to improve application efficiency significantly. We will
> report empirical results as they become available.

Hi,

might be a dumb question because I'm not too familiar with
MEMORY_DEVICE_COHERENT, but who's in charge of migrating *to* that
memory? Or how does a process ever get a grab on such pages?

And where does migration come into play? I assume migration is only
required to migrate off of that device memory to ordinary system RAM
when required because the device memory has to be freed up, correct?

(a high level description on how this is exploited from users space
would be great)
Felix Kuehling Jan. 12, 2022, 4:08 p.m. UTC | #3
Am 2022-01-12 um 6:16 a.m. schrieb David Hildenbrand:
> On 10.01.22 23:31, Alex Sierra wrote:
>> This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
>> owned by a device that can be mapped into CPU page tables like
>> MEMORY_DEVICE_GENERIC and can also be migrated like
>> MEMORY_DEVICE_PRIVATE.
>>
>> Christoph, the suggestion to incorporate Ralph Campbell’s refcount
>> cleanup patch into our hardware page migration patchset originally came
>> from you, but it proved impractical to do things in that order because
>> the refcount cleanup introduced a bug with wide ranging structural
>> implications. Instead, we amended Ralph’s patch so that it could be
>> applied after merging the migration work. As we saw from the recent
>> discussion, merging the refcount work is going to take some time and
>> cooperation between multiple development groups, while the migration
>> work is ready now and is needed now. So we propose to merge this
>> patchset first and continue to work with Ralph and others to merge the
>> refcount cleanup separately, when it is ready.
>>
>> This patch series is mostly self-contained except for a few places where
>> it needs to update other subsystems to handle the new memory type.
>> System stability and performance are not affected according to our
>> ongoing testing, including xfstests.
>>
>> How it works: The system BIOS advertises the GPU device memory
>> (aka VRAM) as SPM (special purpose memory) in the UEFI system address
>> map.
>>
>> The amdgpu driver registers the memory with devmap as
>> MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
>> this hardware page migration capability is the Frontier supercomputer
>> project. This functionality is not AMD-specific. We expect other GPU
>> vendors to find this functionality useful, and possibly other hardware
>> types in the future.
>>
>> Our test nodes in the lab are similar to the Frontier configuration,
>> with .5 TB of system memory plus 256 GB of device memory split across
>> 4 GPUs, all in a single coherent address space. Page migration is
>> expected to improve application efficiency significantly. We will
>> report empirical results as they become available.
> Hi,
>
> might be a dumb question because I'm not too familiar with
> MEMORY_DEVICE_COHERENT, but who's in charge of migrating *to* that
> memory? Or how does a process ever get a grab on such pages?

Device memory management and migration to device memory work the same as
MEMORY_DEVICE_PRIVATE. The device driver is in charge of managing the
memory and migrating data to it in response to application requests
(e.g. hipMemPrefetchAsync) or device page faults.

The nice thing about MEMORY_DEVICE_COHERENT is, that the CPU, or a 3rd
party device (e.g. a NIC) can access the memory without migrations
disrupting execution of high performance application code on the GPU.


>
> And where does migration come into play? I assume migration is only
> required to migrate off of that device memory to ordinary system RAM
> when required because the device memory has to be freed up, correct?

That's one case. For example memory pressure can force the GPU driver to
evict some device-coherent memory back to system memory. Also,
applications can request a migration to system memory explicitly (again
with something like hipMemPrefetchAsync).

Regards,
  Felix


>
> (a high level description on how this is exploited from users space
> would be great)
>
Alistair Popple Jan. 20, 2022, 6:33 a.m. UTC | #4
On Wednesday, 12 January 2022 10:06:03 PM AEDT Alistair Popple wrote:
> I have been looking at this in relation to the migration code and noticed we
> have the following in try_to_migrate():
> 
>         if (is_zone_device_page(page) && !is_device_private_page(page))
>                 return;
> 
> Which if I'm understanding correctly means that migration of device coherent
> pages will always fail. Given that I do wonder how hmm-tests are passing, but
> I assume you must always be hitting this fast path in
> migrate_vma_collect_pmd():
> 
>                 /*
>                  * Optimize for the common case where page is only mapped once
>                  * in one process. If we can lock the page, then we can safely
>                  * set up a special migration page table entry now.
>                  */
> 
> Meaning that try_to_migrate() never gets called from migrate_vma_unmap(). So
> you will also need some changes to try_to_migrate() and possibly
> try_to_migrate_one() to make this reliable.

I have been running the hmm tests with the changes below. I'm pretty sure these
are correct because the only zone device pages try_to_migrate_one() should be
called on are device coherent/private, and coherent pages can be treated just
the same as a normal pages for migration. However it would be worth checking I
haven't missed anything.

 - Alistair

---

diff --git a/mm/rmap.c b/mm/rmap.c
index 163ac4e6bcee..15f56c27daab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1806,7 +1806,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
 
-		if (is_zone_device_page(page)) {
+		if (is_device_private_page(page)) {
 			unsigned long pfn = page_to_pfn(page);
 			swp_entry_t entry;
 			pte_t swp_pte;
@@ -1947,7 +1947,7 @@ void try_to_migrate(struct page *page, enum ttu_flags flags)
 					TTU_SYNC)))
 		return;
 
-	if (is_zone_device_page(page) && !is_device_private_page(page))
+	if (is_zone_device_page(page) && !is_device_page(page))
 		return;
 
 	/*