mbox series

[v6,00/13] Support DEVICE_GENERIC memory in migrate_vma_*

Message ID 20210813063150.2938-1-alex.sierra@amd.com (mailing list archive)
Headers show
Series Support DEVICE_GENERIC memory in migrate_vma_* | expand

Message

Sierra Guiza, Alejandro (Alex) Aug. 13, 2021, 6:31 a.m. UTC
v1:
AMD is building a system architecture for the Frontier supercomputer with a
coherent interconnect between CPUs and GPUs. This hardware architecture allows
the CPUs to coherently access GPU device memory. We have hardware in our labs
and we are working with our partner HPE on the BIOS, firmware and software
for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver looks
it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
using devm_memremap_pages.

Now we're trying to migrate data to and from that memory using the migrate_vma_*
helpers so we can support page-based migration in our unified memory allocations,
while also supporting CPU access to those pages.

This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
correctly in the migrate_vma_* helpers. We are looking for feedback about this
approach. If we're close, what's needed to make our patches acceptable upstream?
If we're not close, any suggestions how else to achieve what we are trying to do
(i.e. page migration and coherent CPU access to VRAM)?

This work is based on HMM and our SVM memory manager that was recently upstreamed
to Dave Airlie's drm-next branch
https://cgit.freedesktop.org/drm/drm/log/?h=drm-next
On top of that we did some rework of our VRAM management for migrations to remove
some incorrect assumptions, allow partially successful migrations and GPU memory
mappings that mix pages in VRAM and system memory.
https://lore.kernel.org/dri-devel/20210527205606.2660-6-Felix.Kuehling@amd.com/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc

v2:
This patch series version has merged "[RFC PATCH v3 0/2]
mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
Ralph Campbell. It also applies at the top of these series, our changes
to support device generic type in migration_vma helpers.
This has been tested in systems with device memory that has coherent
access by CPU.

Also addresses the following feedback made in v1:
- Isolate in one patch kernel/resource.c modification, based
on Christoph's feedback.
- Add helpers check for generic and private type to avoid
duplicated long lines.

v3:
- Include cover letter from v1.
- Rename dax_layout_is_idle_page func to dax_page_unused in patch
ext4/xfs: add page refcount helper.

v4:
- Add support for zone device generic type in lib/test_hmm and
tool/testing/selftest/vm/hmm-tests.
- Add missing page refcount helper to fuse/dax.c. This was included in
one of Ralph Campbell's patches.

v5:
- Cosmetic changes on patches 3, 5 and 13
- A bug was found while running one of the xfstest (generic/413) used to
validate fs_dax device type. This was first introduced by patch: "mm: remove
extra ZONE_DEVICE struct page refcount" whic is part of these patch series.
The bug was showed as WARNING message at try_grab_page function call, due to
a page refcounter equal to zero. Part of "mm: remove extra ZONE_DEVICE struct
page refcount" changes, was to initialize page refcounter to zero. Therefore,
a special condition was added to try_grab_page on this v5, were it checks for
device zone pages too. It is included in the same patch.

This is how mm changes from these patch series have been validated:
- hmm-tests were run using device private and device generic types. This last,
just added in these patch series. efi_fake_mem was used to mimic SPM memory
for device generic.
- xfstests tool was used to validate fs-dax device type and page refcounter
changes. DAX configuration was used along with emulated Persisten Memory set as
memmap=4G!4G memmap=4G!9G. xfstests were run from ext4 and generic lists. Some
of them, did not run due to limitations in configuration. Ex. test not
supporting specific file system or DAX mode.
Only three tests failed, generic/356/357 and ext4/049. However, these failures
were consistent before and after applying these patch series.
xfstest configuration:
TEST_DEV=/dev/pmem0
TEST_DIR=/mnt/ram0
SCRATCH_DEV=/dev/pmem1
SCRATCH_MNT=/mnt/ram1
TEST_FS_MOUNT_OPTS="-o dax"
EXT_MOUNT_OPTIONS="-o dax"
MKFS_OPTIONS="-b4096"
xfstest passed list:
Ext4:
001,003,005,021,022,023,025,026,030,031,032,036,037,038,042,043,044,271,306
Generic:
1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,20,21,22,23,24,25,28,29,30,31,32,33,35,37,
50,52,53,58,60,61,62,63,64,67,69,70,71,75,76,78,79,80,82,84,86,87,88,91,92,94,
96,97,98,99,103,105,112,113,114,117,120,124,126,129,130,131,135,141,169,184,
198,207,210,211,212,213,214,215,221,223,225,228,236,237,240,244,245,246,247,
248,249,255,257,258,263,277,286,294,306,307,308,309,313,315,316,318,319,337,
346,360,361,371,375,377,379,380,383,384,385,386,389,391,392,393,394,400,401,
403,404,406,409,410,411,412,413,417,420,422,423,424,425,426,427,428

v6:
- These patch series was rebased on amd-staging-drm-next, which in turn is
based on v5.13:
https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-staging-drm-next
- Handle null pointers in dmirror_allocate_chunk at test_hmm.c
- Here's a link to the repo including these patch series:
https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/alexsierrag/device_generic

- CONFIGS required to run hmm-tests and xfstest with no special Hardware.
For hmm-tests:
CONFIG_EFI_FAKE_MEMMAP=y
CONFIG_EFI_SOFT_RESERVE=y
CONFIG_TEST_HMM=m
CONFIG_RUNTIME_TESTING_MENU=y

For xfstest using emulated persistant memory:
CONFIG_X86_PMEM_LEGACY=y
CONFIG_LIBNVDIMM=y
CONFIG_BLK_DEV_PMEM=y
CONFIG_FS_DAX=y
CONFIG_DAX_DRIVER=y
CONFIG_VIRTIO_FS=y

HMM configs for both hmm-test and xfstest:
CONFIG_ZONE_DEVICE=y
CONFIG_HMM_MIRROR=y
CONFIG_MMU_NOTIFIER=y
CONFIG_DEVICE_PRIVATE=y 

- Kernel parameters to run hmm-tests and xfstests.
These tests require to either emulate persistent memory (EPM) for xfstests or
fake special memory purpose (FSPM) for hmm-tests device generic type
configuration. This is achieved by using system memory for both purposes.
The idea is to reserve ranges of physical address by passing specific kernel
parameters. Make sure your kernel has built with the proper CONFIGS mentioned
above. Once you reserve memory ranges through these two mechanisms, they
cannot be used by the kernel as regular system memory. Until these kernel
parameters are removed. Both mechanisms use similar parameters to define
physical address and size. FSPM, however, uses a third field which is the
attribute value. Here’s the syntax for both:
FSPM: efi_fake_mem= nn[KMG]@ss[KMG]:aa
EPM: memmap=nn[KMG]!ss[KMG]
'nn' defines the size (in GB) of memory reserved 
'mm' physical/usable start address. This can be taken from BIOS-e820 mem table
'aa' specify attribute. SPM attribute is EFI_MEMORY_SP(0x40000)
[KMG]: refers to kilo, mega, giga
To find an available memory region address, you could look into BIOS-e820 mem
table. Usually this is printed at kernel boot (dmesg). At this table, make sure
you choose ranges marked as 'usable' and has at least the same or more range
size as your desired reservation. Ex. Range below has a size of 13GB, from a
total of 16GB of system memory.
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000044eafffff] usable

In our testing, we require two ranges of 4GB each for xfstests. And two more of
1GB each for hmm-tests. Total of 10GB.

Based on range above we set these two kernel parameters as follows:
EPM:
memmap=4G!4G memmap=4G!9G
FSPM:
efi_fake_mem=1G@0x200000000:0x40000,1G@0x340000000:0x40000
We alternate one EPM reserve (4GB) and one FSPM (1GB). Starting @4GB address.
These kernel parameters can be passed by editing grub file. Under "/etc/default/grub".
GRUB_CMDLINE_LINUX="memmap=4G!4G memmap=4G!9G
efi_fake_mem=1G@0x200000000:0x40000,1G@0x340000000:0x40000"
Once you have modified this file, don’t forget to update the grub.
$sudo update-grub

After booting with these parameters applied, you should see the new ranges
defined at the "extended physical RAM map" table. This is printed at boot:
reserve setup_data: [mem 0x0000000100000000-0x00000001ffffffff] persistent (type 12)

reserve setup_data: [mem 0x0000000200000000-0x000000023fffffff] soft reserved

reserve setup_data: [mem 0x0000000240000000-0x000000033fffffff] persistent (type 12)

reserve setup_data: [mem 0x0000000340000000-0x000000037fffffff] soft reserved

As you see, EPM ranges are now labeled as Persistent (type 12) and FSPM ranges
as soft reserved. 

- Setting and running hmm-tests
These tests can now be run either with device private or device generic types.
This last, by setting Special Purpose Memory.
To manually run them, on your kernel directory go to:
$cd tools/testing/selftests/vm/

To run device private, enter:
$sudo ./test_hmm.sh smoke

To run device generic, you must pass the physical start addresses for both SP
regions. In this example, these can be taken from above’s table labeled with
"soft reserved":
$sudo ./test_hmm.sh smoke 0x200000000 0x340000000

The same hmm-tests are executed for both device types.

- Setting and running xfstest
Clone xfstests-dev repo
$git clone git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
$cd xfstests-dev
$make
$sudo make install

On xfstests-dev directory, create a local.config file with the following
information:
TEST_DEV=/dev/pmem0
TEST_DIR=/mnt/ram0
SCRATCH_DEV=/dev/pmem1
SCRATCH_MNT=/mnt/ram1
TEST_FS_MOUNT_OPTS="-o dax"
EXT_MOUNT_OPTIONS="-o dax"
MKFS_OPTIONS="-b4096"

Create mounting directories:
$sudo mkdir /mnt/ram0
$sudo mkdir /mnt/ram1

Everytime you boot, you need to create ext4 file system for the emulated
persistent memory partitions.
$sudo mkfs.ext4 /dev/pmem0
$sudo mkfs.ext4 /dev/pmem1

To run the tests:
$sudo ./check -g quick

Alex Sierra (11):
  kernel: resource: lookup_resource as exported symbol
  drm/amdkfd: add SPM support for SVM
  drm/amdkfd: generic type as sys mem on migration to ram
  include/linux/mm.h: helpers to check zone device generic type
  mm: add generic type support to migrate_vma helpers
  mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
  lib: test_hmm add ioctl to get zone device type
  lib: test_hmm add module param for zone device type
  lib: add support for device generic type in test_hmm
  tools: update hmm-test to support device generic type
  tools: update test_hmm script to support SP config

Ralph Campbell (2):
  ext4/xfs: add page refcount helper
  mm: remove extra ZONE_DEVICE struct page refcount

 arch/powerpc/kvm/book3s_hv_uvmem.c       |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  22 ++-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |   2 +-
 fs/dax.c                                 |   8 +-
 fs/ext4/inode.c                          |   5 +-
 fs/fuse/dax.c                            |   4 +-
 fs/xfs/xfs_file.c                        |   4 +-
 include/linux/dax.h                      |  10 +
 include/linux/memremap.h                 |   7 +-
 include/linux/mm.h                       |  21 +-
 kernel/resource.c                        |   1 +
 lib/test_hmm.c                           | 237 +++++++++++++++--------
 lib/test_hmm_uapi.h                      |  16 ++
 mm/internal.h                            |   8 +
 mm/memremap.c                            |  69 ++-----
 mm/migrate.c                             |  23 ++-
 mm/page_alloc.c                          |   3 +
 mm/swap.c                                |  45 +----
 tools/testing/selftests/vm/hmm-tests.c   | 142 ++++++++++++--
 tools/testing/selftests/vm/test_hmm.sh   |  20 +-
 20 files changed, 411 insertions(+), 238 deletions(-)