diff mbox series

[06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Message ID 20240117221223.18540-7-oak.zeng@intel.com (mailing list archive)
State New, archived
Headers show
Series XeKmd basic SVM support | expand

Commit Message

Oak Zeng Jan. 17, 2024, 10:12 p.m. UTC
Introduce xe_svm_build_sg helper function to build a scatter
gather table from a hmm_range struct. This is prepare work
for binding hmm range to gpu.

Signed-off-by: Oak Zeng <oak.zeng@intel.com>
Co-developed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Brian Welty <brian.welty@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 52 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_svm.h |  3 +++
 2 files changed, 55 insertions(+)

Comments

Jason Gunthorpe April 5, 2024, 12:39 a.m. UTC | #1
On Wed, Jan 17, 2024 at 05:12:06PM -0500, Oak Zeng wrote:
> +/**
> + * xe_svm_build_sg() - build a scatter gather table for all the physical pages/pfn
> + * in a hmm_range.
> + *
> + * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
> + * has the pfn numbers of pages that back up this hmm address range.
> + * @st: pointer to the sg table.
> + *
> + * All the contiguous pfns will be collapsed into one entry in
> + * the scatter gather table. This is for the convenience of
> + * later on operations to bind address range to GPU page table.
> + *
> + * This function allocates the storage of the sg table. It is
> + * caller's responsibility to free it calling sg_free_table.
> + *
> + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> + */
> +int xe_svm_build_sg(struct hmm_range *range,
> +			     struct sg_table *st)
> +{
> +	struct scatterlist *sg;
> +	u64 i, npages;
> +
> +	sg = NULL;
> +	st->nents = 0;
> +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >> PAGE_SHIFT) + 1;
> +
> +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> +		return -ENOMEM;
> +
> +	for (i = 0; i < npages; i++) {
> +		unsigned long addr = range->hmm_pfns[i];
> +
> +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> +			sg->length += PAGE_SIZE;
> +			sg_dma_len(sg) += PAGE_SIZE;
> +			continue;
> +		}
> +
> +		sg =  sg ? sg_next(sg) : st->sgl;
> +		sg_dma_address(sg) = addr;
> +		sg_dma_len(sg) = PAGE_SIZE;
> +		sg->length = PAGE_SIZE;
> +		st->nents++;
> +	}
> +
> +	sg_mark_end(sg);
> +	return 0;
> +}

I didn't look at this series a lot but I wanted to make a few
remarks.. This I don't like quite a lot. Yes, the DMA API interaction
with hmm_range_fault is pretty bad, but it should not be hacked
around like this. Leon is working on a series to improve it:

https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/

Please participate there too. In the mean time you should just call
dma_map_page for every single page like ODP does.

Also, I tried to follow the large discussion in the end but it was
quite hard to read the text in Lore for some reason.

I would just opine some general points on how I see hmm_range_fault
being used by drivers.

First of all, the device should have a private page table. At least
one, but ideally many. Obviously it should work, so I found it a bit
puzzling the talk about problems with virtualization. Either the
private page table works virtualized, including faults, or it should
not be available..

Second, I see hmm_range_fault as having two main design patterns
interactions. Either it is the entire exclusive owner of a single
device private page table and fully mirrors the mm page table into the
device table.

Or it is a selective mirror where it copies part of the mm page table
into a "vma" of a possibly shared device page table. The
hmm_range_fault bit would exclusively own it's bit of VMA.

So I find it a quite strange that this RFC is creating VMA's,
notifiers and ranges on the fly. That should happen when userspace
indicates it wants some/all of the MM to be bound to a specific device
private pagetable/address space.

I also agree with the general spirit of the remarks that there should
not be a process binding or any kind of "global" character
device. Device private page tables are by their very nature private to
the device and should be created through a device specific char
dev. If you have a VMA concept for these page tables then a HMM
mirroring one is simply a different type of VMA along with all the
others.

I was also looking at the mmu notifier register/unregister with some
suspicion, it seems wrong to have a patch talking about "process
exit". Notifiers should be destroyed when the device private page
table is destroyed, or the VMA is destroyed. Attempting to connect it
to a process beyond tying the lifetime of a page table to a FD is
nonsensical.

Jason
Oak Zeng April 5, 2024, 3:33 a.m. UTC | #2
Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 4, 2024 8:39 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> hmm range
> 
> On Wed, Jan 17, 2024 at 05:12:06PM -0500, Oak Zeng wrote:
> > +/**
> > + * xe_svm_build_sg() - build a scatter gather table for all the physical
> pages/pfn
> > + * in a hmm_range.
> > + *
> > + * @range: the hmm range that we build the sg table from. range-
> >hmm_pfns[]
> > + * has the pfn numbers of pages that back up this hmm address range.
> > + * @st: pointer to the sg table.
> > + *
> > + * All the contiguous pfns will be collapsed into one entry in
> > + * the scatter gather table. This is for the convenience of
> > + * later on operations to bind address range to GPU page table.
> > + *
> > + * This function allocates the storage of the sg table. It is
> > + * caller's responsibility to free it calling sg_free_table.
> > + *
> > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > + */
> > +int xe_svm_build_sg(struct hmm_range *range,
> > +			     struct sg_table *st)
> > +{
> > +	struct scatterlist *sg;
> > +	u64 i, npages;
> > +
> > +	sg = NULL;
> > +	st->nents = 0;
> > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >>
> PAGE_SHIFT) + 1;
> > +
> > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < npages; i++) {
> > +		unsigned long addr = range->hmm_pfns[i];
> > +
> > +		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> > +			sg->length += PAGE_SIZE;
> > +			sg_dma_len(sg) += PAGE_SIZE;
> > +			continue;
> > +		}
> > +
> > +		sg =  sg ? sg_next(sg) : st->sgl;
> > +		sg_dma_address(sg) = addr;
> > +		sg_dma_len(sg) = PAGE_SIZE;
> > +		sg->length = PAGE_SIZE;
> > +		st->nents++;
> > +	}
> > +
> > +	sg_mark_end(sg);
> > +	return 0;
> > +}
> 
> I didn't look at this series a lot but I wanted to make a few
> remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> with hmm_range_fault is pretty bad, but it should not be hacked
> around like this. Leon is working on a series to improve it:
> 
> https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/


I completely agree above codes are really ugly. We definitely want to adapt to a better way. I will spend some time on above series.

> 
> Please participate there too. In the mean time you should just call
> dma_map_page for every single page like ODP does.

Above codes deal with a case where dma map is not needed. As I understand it, whether we need a dma map depends on the devices topology. For example, when device access host memory or another device's memory through pcie, we need dma mapping; if the connection b/t devices is xelink (similar to nvidia's nvlink), all device's memory can be in same address space, so no dma mapping is needed.


> 
> Also, I tried to follow the large discussion in the end but it was
> quite hard to read the text in Lore for some reason.

Did you mean this discussion: https://lore.kernel.org/dri-devel/?q=Making+drm_gpuvm+work+across+gpu+devices? This link works good for me with chrome browser.


> 
> I would just opine some general points on how I see hmm_range_fault
> being used by drivers.
> 
> First of all, the device should have a private page table. At least
> one, but ideally many. Obviously it should work, so I found it a bit
> puzzling the talk about problems with virtualization. Either the
> private page table works virtualized, including faults, or it should
> not be available..

To be very honest, I was also very confused. In this series, I had one very fundamental assumption that with hmm any valid cpu virtual address is also a valid gpu virtual address. But Christian had a very strong opinion that the gpu va can have an offset to cpu va. He mentioned a failed use case with amdkfd and claimed an offset can solve their problem.

For all our known use cases, gpu va == cpu va. But we had agreed to make the uAPI to be flexible so we can introduce a offset if a use case come out in the future.

> 
> Second, I see hmm_range_fault as having two main design patterns
> interactions. Either it is the entire exclusive owner of a single
> device private page table and fully mirrors the mm page table into the
> device table.
> 
> Or it is a selective mirror where it copies part of the mm page table
> into a "vma" of a possibly shared device page table. The
> hmm_range_fault bit would exclusively own it's bit of VMA.

Can you explain what is "hmm_range_fault bit"?


The whole page table (process space) is mirrored. But we don't have to copy the whole CPU page table to device page table. We only need to copy the page table entries of an address range which is accessed by GPU. For those address ranges which are not accessed by GPU, there is no need to set up GPU page table.

> 
> So I find it a quite strange that this RFC is creating VMA's,
> notifiers and ranges on the fly. That should happen when userspace
> indicates it wants some/all of the MM to be bound to a specific device
> private pagetable/address space.

We register notifier on the fly because GPU doesn't access all the valid CPU virtual addresses. GPU only access a subset of valid CPU addresses.

Do you think register a huge mmu notifier to cover the whole address space would be good? I don't know here but there would be much more unnecessary callbacks from mmu to device driver.

Similarly, we create range only the fly for those range that is accessed by gpu. But we have some idea to keep one gigantic range/VMA representing the whole address space while creating only some "gpu page table state range" on the fly. This idea requires some refactor to our xe driver and we will evaluate it more to decide whether we want to go this way.


> 
> I also agree with the general spirit of the remarks that there should
> not be a process binding or any kind of "global" character
> device. 

Even though a global pseudo device looks bad, it does come with some benefit. For example, if you want to set a memory attributes to a shared virtual address range b/t all devices, you can set such attributes through a ioctl of the global device. We have agreed to remove our global character device and we will repeat the memory attributes setting on all devices one by one. 

Is /dev/nvidia-uvm a global character device for uvm purpose?

Device private page tables are by their very nature private to
> the device and should be created through a device specific char
> dev. If you have a VMA concept for these page tables then a HMM
> mirroring one is simply a different type of VMA along with all the
> others.
> 
> I was also looking at the mmu notifier register/unregister with some
> suspicion, it seems wrong to have a patch talking about "process
> exit". Notifiers should be destroyed when the device private page
> table is destroyed, or the VMA is destroyed. 

Right. I have dropped the concept of process exit. I will soon send out the new series for review.

Oak

Attempting to connect it
> to a process beyond tying the lifetime of a page table to a FD is
> nonsensical.
> 
> Jason
Jason Gunthorpe April 5, 2024, 12:37 p.m. UTC | #3
On Fri, Apr 05, 2024 at 03:33:10AM +0000, Zeng, Oak wrote:
> > 
> > I didn't look at this series a lot but I wanted to make a few
> > remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> > with hmm_range_fault is pretty bad, but it should not be hacked
> > around like this. Leon is working on a series to improve it:
> > 
> > https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/
> 
> 
> I completely agree above codes are really ugly. We definitely want
> to adapt to a better way. I will spend some time on above series.
> 
> > 
> > Please participate there too. In the mean time you should just call
> > dma_map_page for every single page like ODP does.
> 
> Above codes deal with a case where dma map is not needed. As I
> understand it, whether we need a dma map depends on the devices
> topology. For example, when device access host memory or another
> device's memory through pcie, we need dma mapping; if the connection
> b/t devices is xelink (similar to nvidia's nvlink), all device's
> memory can be in same address space, so no dma mapping is needed.

Then you call dma_map_page to do your DMA side and you avoid it for
the DEVICE_PRIVATE side. SG list doesn't help this anyhow.

> > Also, I tried to follow the large discussion in the end but it was
> > quite hard to read the text in Lore for some reason.
> 
> Did you mean this discussion: https://lore.kernel.org/dri-devel/?q=Making+drm_gpuvm+work+across+gpu+devices? This link works good for me with chrome browser.

That is the one I am referring to

> > I would just opine some general points on how I see hmm_range_fault
> > being used by drivers.
> > 
> > First of all, the device should have a private page table. At least
> > one, but ideally many. Obviously it should work, so I found it a bit
> > puzzling the talk about problems with virtualization. Either the
> > private page table works virtualized, including faults, or it should
> > not be available..
>
> To be very honest, I was also very confused. In this series, I had
> one very fundamental assumption that with hmm any valid cpu virtual
> address is also a valid gpu virtual address. But Christian had a
> very strong opinion that the gpu va can have an offset to cpu va. He
> mentioned a failed use case with amdkfd and claimed an offset can
> solve their problem.

Offset is something different, I said the VM's view of the page table
should fully work. You shouldn't get into a weird situation where the
VM is populating the page table and can't handle faults or something.

If the VMM has a weird design where there is only one page table and
it needs to partition space based on slicing it into regions then
fine, but the delegated region to the guest OS should still work
fully.

> > Or it is a selective mirror where it copies part of the mm page table
> > into a "vma" of a possibly shared device page table. The
> > hmm_range_fault bit would exclusively own it's bit of VMA.
> 
> Can you explain what is "hmm_range_fault bit"?

I mean if you setup a mirror VMA in a device private page table then that
range of VA will be owned by the hmm_range_fault code and will mirror
a subset of a mm into that VMA. This is the offset you mention
above. The MM's VA and the device private page table VA do not have to
be the same (eg we implement this option in RDMA's ODP)

A 1:1 SVA mapping is a special case of this where there is a single
GPU VMA that spans the entire process address space with a 1:1 VA (no
offset).

> Do you think register a huge mmu notifier to cover the whole address
> space would be good? I don't know here but there would be much more
> unnecessary callbacks from mmu to device driver.

Yes. IMHO you should try to optimize the invalidations away in driver
logic not through dynamic mmu notifiers. Installing and removing a
notifier is very expensive.

> Similarly, we create range only the fly for those range that is
> accessed by gpu. But we have some idea to keep one gigantic
> range/VMA representing the whole address space while creating only
> some "gpu page table state range" on the fly. This idea requires
> some refactor to our xe driver and we will evaluate it more to
> decide whether we want to go this way.

This is a better direction.
 
> > I also agree with the general spirit of the remarks that there should
> > not be a process binding or any kind of "global" character
> > device. 
> 
> Even though a global pseudo device looks bad, it does come with some
> benefit. For example, if you want to set a memory attributes to a
> shared virtual address range b/t all devices, you can set such
> attributes through a ioctl of the global device. We have agreed to
> remove our global character device and we will repeat the memory
> attributes setting on all devices one by one.

That implies you have a global shared device private page table which
is sort of impossible because of how the DMA API works.

Having the kernel iterate over all the private page tables vs having
the userspace iterate over all the private page tables doesn't seem
like a worthwile difference to justify a global cdev.

> Is /dev/nvidia-uvm a global character device for uvm purpose?

No idea, I wouldn't assume anything the nvidia drivers do is aligned
with what upstream expects.

Jason
Oak Zeng April 5, 2024, 4:42 p.m. UTC | #4
> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of Jason
> Gunthorpe
> Sent: Friday, April 5, 2024 8:37 AM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> hmm range
> 
> On Fri, Apr 05, 2024 at 03:33:10AM +0000, Zeng, Oak wrote:
> > >
> > > I didn't look at this series a lot but I wanted to make a few
> > > remarks.. This I don't like quite a lot. Yes, the DMA API interaction
> > > with hmm_range_fault is pretty bad, but it should not be hacked
> > > around like this. Leon is working on a series to improve it:
> > >
> > > https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/
> >
> >
> > I completely agree above codes are really ugly. We definitely want
> > to adapt to a better way. I will spend some time on above series.
> >
> > >
> > > Please participate there too. In the mean time you should just call
> > > dma_map_page for every single page like ODP does.
> >
> > Above codes deal with a case where dma map is not needed. As I
> > understand it, whether we need a dma map depends on the devices
> > topology. For example, when device access host memory or another
> > device's memory through pcie, we need dma mapping; if the connection
> > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > memory can be in same address space, so no dma mapping is needed.
> 
> Then you call dma_map_page to do your DMA side and you avoid it for
> the DEVICE_PRIVATE side. SG list doesn't help this anyhow.

When dma map is needed, we used dma_map_sgtable, a different flavor of the dma-map-page function.

The reason we also used (mis-used) sg list for non-dma-map cases, is because we want to re-use some data structure. Our goal here is, for a hmm_range, build a list of device physical address (can be dma-mapped or not), which will be used later on to program the device page table. We re-used the sg list structure for the non-dma-map cases so those two cases can share the same page table programming codes. Since sg list was originally designed for dma-map, it does look like this is mis-used here.

Need to mention, even for some DEVICE_PRIVATE memory, we also need dma-mapping. For example, if you have two devices connected to each other through PCIe, both devices memory are registered as DEVICE_PRIVATE to hmm. I believe we need a dma-map when one device access another device's memory. Two devices' memory doesn't belong to same address space in this case. For modern GPU with xeLink/nvLink/XGMI, this is not needed.


> 
> > > Also, I tried to follow the large discussion in the end but it was
> > > quite hard to read the text in Lore for some reason.
> >
> > Did you mean this discussion: https://lore.kernel.org/dri-
> devel/?q=Making+drm_gpuvm+work+across+gpu+devices? This link works good
> for me with chrome browser.
> 
> That is the one I am referring to
> 
> > > I would just opine some general points on how I see hmm_range_fault
> > > being used by drivers.
> > >
> > > First of all, the device should have a private page table. At least
> > > one, but ideally many. Obviously it should work, so I found it a bit
> > > puzzling the talk about problems with virtualization. Either the
> > > private page table works virtualized, including faults, or it should
> > > not be available..
> >
> > To be very honest, I was also very confused. In this series, I had
> > one very fundamental assumption that with hmm any valid cpu virtual
> > address is also a valid gpu virtual address. But Christian had a
> > very strong opinion that the gpu va can have an offset to cpu va. He
> > mentioned a failed use case with amdkfd and claimed an offset can
> > solve their problem.
> 
> Offset is something different, I said the VM's view of the page table
> should fully work. You shouldn't get into a weird situation where the
> VM is populating the page table and can't handle faults or something.
> 

We don't have such weird situation. There are two layers of translations when run under virtualized environment. From guest VM's perspective, the first level page table is in the guest device physical address space. It is nothing different from bare-metal situation. Our driver doesn't need to know it runs under virtualized or bare-metal for first level page table programming and page fault handling. 

> If the VMM has a weird design where there is only one page table and
> it needs to partition space based on slicing it into regions then
> fine, but the delegated region to the guest OS should still work
> fully.

Agree.

> 
> > > Or it is a selective mirror where it copies part of the mm page table
> > > into a "vma" of a possibly shared device page table. The
> > > hmm_range_fault bit would exclusively own it's bit of VMA.
> >
> > Can you explain what is "hmm_range_fault bit"?
> 
> I mean if you setup a mirror VMA in a device private page table then that
> range of VA will be owned by the hmm_range_fault code and will mirror
> a subset of a mm into that VMA. This is the offset you mention
> above. The MM's VA and the device private page table VA do not have to
> be the same (eg we implement this option in RDMA's ODP)

Ok, it is great to hear a use case where cpu va != gpu va

> 
> A 1:1 SVA mapping is a special case of this where there is a single
> GPU VMA that spans the entire process address space with a 1:1 VA (no
> offset).

From implementation perspective, we can have one device page table for one process for such 1:1 va mapping, but it is not necessary to have a single gpu vma. We can have many gpu vma each cover a segment of this address space. We have this structure before this svm/system allocator work. This is also true for core mm vma. But as said, we are also open/exploring a single gigantic gpu vma to cover the whole address space.

> 
> > Do you think register a huge mmu notifier to cover the whole address
> > space would be good? I don't know here but there would be much more
> > unnecessary callbacks from mmu to device driver.
> 
> Yes. IMHO you should try to optimize the invalidations away in driver
> logic not through dynamic mmu notifiers. Installing and removing a
> notifier is very expensive.

Ok, we will compare the performance of those two methods.

> 
> > Similarly, we create range only the fly for those range that is
> > accessed by gpu. But we have some idea to keep one gigantic
> > range/VMA representing the whole address space while creating only
> > some "gpu page table state range" on the fly. This idea requires
> > some refactor to our xe driver and we will evaluate it more to
> > decide whether we want to go this way.
> 
> This is a better direction.
> 
> > > I also agree with the general spirit of the remarks that there should
> > > not be a process binding or any kind of "global" character
> > > device.
> >
> > Even though a global pseudo device looks bad, it does come with some
> > benefit. For example, if you want to set a memory attributes to a
> > shared virtual address range b/t all devices, you can set such
> > attributes through a ioctl of the global device. We have agreed to
> > remove our global character device and we will repeat the memory
> > attributes setting on all devices one by one.
> 
> That implies you have a global shared device private page table which
> is sort of impossible because of how the DMA API works.

In the case of multiple devices on one machine, we don't have a shared global device page table. Each device can still have its own page table, even though page tables of all device mirror the same address space (ie., same virtual address pointing to same physical location in cpu page table and all device page tables).

The reason we had the global device is mainly for setting vma memory attributes purpose, for example, for one vma, user can set the preferred backing store placement to be on specific device. Without  a global pseudo device, we would have to perform such settings on each devices through each device's ioctl, and we would not be able to tell device 0 that the preferred placement is on device 1, because device 0 even doesn't know the existing of device 1...

Anyway let's continue multiple devices discussion under thread "Cross-device and cross-driver HMM support"

Oak

> 
> Having the kernel iterate over all the private page tables vs having
> the userspace iterate over all the private page tables doesn't seem
> like a worthwile difference to justify a global cdev.
> 
> > Is /dev/nvidia-uvm a global character device for uvm purpose?
> 
> No idea, I wouldn't assume anything the nvidia drivers do is aligned
> with what upstream expects.
> 
> Jason
Jason Gunthorpe April 5, 2024, 6:02 p.m. UTC | #5
On Fri, Apr 05, 2024 at 04:42:14PM +0000, Zeng, Oak wrote:
> > > Above codes deal with a case where dma map is not needed. As I
> > > understand it, whether we need a dma map depends on the devices
> > > topology. For example, when device access host memory or another
> > > device's memory through pcie, we need dma mapping; if the connection
> > > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > > memory can be in same address space, so no dma mapping is needed.
> > 
> > Then you call dma_map_page to do your DMA side and you avoid it for
> > the DEVICE_PRIVATE side. SG list doesn't help this anyhow.
> 
> When dma map is needed, we used dma_map_sgtable, a different flavor
> of the dma-map-page function.

I saw, I am saying this should not be done. You cannot unmap bits of
a sgl mapping if an invalidation comes in.

> The reason we also used (mis-used) sg list for non-dma-map cases, is
> because we want to re-use some data structure. Our goal here is, for
> a hmm_range, build a list of device physical address (can be
> dma-mapped or not), which will be used later on to program the
> device page table. We re-used the sg list structure for the
> non-dma-map cases so those two cases can share the same page table
> programming codes. Since sg list was originally designed for
> dma-map, it does look like this is mis-used here.

Please don't use sg list at all for this.
 
> Need to mention, even for some DEVICE_PRIVATE memory, we also need
> dma-mapping. For example, if you have two devices connected to each
> other through PCIe, both devices memory are registered as
> DEVICE_PRIVATE to hmm. 

Yes, but you don't ever dma map DEVICE_PRIVATE.

> I believe we need a dma-map when one device access another device's
> memory. Two devices' memory doesn't belong to same address space in
> this case. For modern GPU with xeLink/nvLink/XGMI, this is not
> needed.

Review my emails here:

https://lore.kernel.org/dri-devel/20240403125712.GA1744080@nvidia.com/

Which explain how it should work.

> > A 1:1 SVA mapping is a special case of this where there is a single
> > GPU VMA that spans the entire process address space with a 1:1 VA (no
> > offset).
> 
> From implementation perspective, we can have one device page table
> for one process for such 1:1 va mapping, but it is not necessary to
> have a single gpu vma. We can have many gpu vma each cover a segment
> of this address space. 

This is not what I'm talking about. The GPU VMA is bound to a specific
MM VA, it should not be created on demand.

If you want the full 1:1 SVA case to optimize invalidations you don't
need something like a VMA, a simple bitmap reducing the address space
into 1024 faulted in chunks or something would be much cheaper than
some dynamic VMA ranges.

Jason
Oak Zeng April 9, 2024, 4:45 p.m. UTC | #6
Hi Jason

We are re-spinning this series based on the previous community feedback. I will send out a v2 soon. There were big changes compared to v1. So we would better to discuss this work with v2. 

See some reply inline.

> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, April 5, 2024 2:02 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> hmm range
> 
> On Fri, Apr 05, 2024 at 04:42:14PM +0000, Zeng, Oak wrote:
> > > > Above codes deal with a case where dma map is not needed. As I
> > > > understand it, whether we need a dma map depends on the devices
> > > > topology. For example, when device access host memory or another
> > > > device's memory through pcie, we need dma mapping; if the connection
> > > > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > > > memory can be in same address space, so no dma mapping is needed.
> > >
> > > Then you call dma_map_page to do your DMA side and you avoid it for
> > > the DEVICE_PRIVATE side. SG list doesn't help this anyhow.
> >
> > When dma map is needed, we used dma_map_sgtable, a different flavor
> > of the dma-map-page function.
> 
> I saw, I am saying this should not be done. You cannot unmap bits of
> a sgl mapping if an invalidation comes in.

You are right, if we register a huge mmu interval notifier to cover the whole address space, then we should use dma map/unmap pages to map bits of the address space. We will explore this approach.

Right now, in xe driver, mmu interval notifier is dynamically registered with small address range. We map/unmap the whole small address ranges each time. So functionally it still works. But it might not be as performant as the method you said. This is existing logic for our userptr codes. Our system allocator inherit this logic automatically as our system allocator design is built on top of userptr (will send out v2 soon ).We plan to make things work in the first stage then do some performance improvement like you suggested here.

> 
> > The reason we also used (mis-used) sg list for non-dma-map cases, is
> > because we want to re-use some data structure. Our goal here is, for
> > a hmm_range, build a list of device physical address (can be
> > dma-mapped or not), which will be used later on to program the
> > device page table. We re-used the sg list structure for the
> > non-dma-map cases so those two cases can share the same page table
> > programming codes. Since sg list was originally designed for
> > dma-map, it does look like this is mis-used here.
> 
> Please don't use sg list at all for this.

As explained, we use sg list for device private pages so we can re-used the gpu page table update codes. The input of the gpu page table update codes in this case is a list of dma address (in case of system memory) or device physical address (in case of device private pages). The gpu page table update codes in xe driver is pretty complicated, so re-use that codes is a preferable thing for us. If we introduce different data structure we would have to re-write part of the gpu page table update codes.

I don't see an obvious problem of this approach. But if you see this a problem, I am open to change it.

> 
> > Need to mention, even for some DEVICE_PRIVATE memory, we also need
> > dma-mapping. For example, if you have two devices connected to each
> > other through PCIe, both devices memory are registered as
> > DEVICE_PRIVATE to hmm.
> 
> Yes, but you don't ever dma map DEVICE_PRIVATE.
> 
> > I believe we need a dma-map when one device access another device's
> > memory. Two devices' memory doesn't belong to same address space in
> > this case. For modern GPU with xeLink/nvLink/XGMI, this is not
> > needed.
> 
> Review my emails here:
> 
> https://lore.kernel.org/dri-devel/20240403125712.GA1744080@nvidia.com/
> 
> Which explain how it should work.

You are right. Dma map is not needed for device private

> 
> > > A 1:1 SVA mapping is a special case of this where there is a single
> > > GPU VMA that spans the entire process address space with a 1:1 VA (no
> > > offset).
> >
> > From implementation perspective, we can have one device page table
> > for one process for such 1:1 va mapping, but it is not necessary to
> > have a single gpu vma. We can have many gpu vma each cover a segment
> > of this address space.
> 
> This is not what I'm talking about. The GPU VMA is bound to a specific
> MM VA, it should not be created on demand.

Today we have two places where we create gpu vma: 1) create gpu vma during a vm_bind ioctl 2) create gpu vma during a page fault of the system allocator range (this will be in v2 of this series).

So for case 2), we do create gpu vma on demand. We also plan to explore doing this differently, such as only keep a gigantic gpu vma covering the whole address space for system allocator while only create some gpu page table state during page fault handling. This is planned for stage 2.

> 
> If you want the full 1:1 SVA case to optimize invalidations you don't
> need something like a VMA, 

A gpu vma (xe_vma struct in xe driver codes) is a very fundamental concept in our driver. Leveraging vma can re-use a lot of existing driver codes such as page table update.

But you are right, strictly speaking we don't need a vma. Actually in this v1 version I sent out, we don't have a gpu vma concept for system allocator. But we do have a svm range concept. We created a temporary vma for page table update purpose. Again this design will be obsolete in v2 - in v2 system allocator leverage userptr codes which incorporate with gpu vma. 


a simple bitmap reducing the address space
> into 1024 faulted in chunks or something would be much cheaper than
> some dynamic VMA ranges.


I suspect something dynamic is still necessary, either a vma or a page table state (1 vma but many page table state created dynamically, as planned in our stage 2). The reason is, we still need some gpu corresponding structure to match the cpu vm_area_struct. For example, when gpu page fault happens, you look up the cpu vm_area_struct for the fault address and create a corresponding state/struct. And people can have as many cpu vm_area_struct as they want.

Oak  

> 
> Jason
Jason Gunthorpe April 9, 2024, 5:24 p.m. UTC | #7
On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:

> > I saw, I am saying this should not be done. You cannot unmap bits of
> > a sgl mapping if an invalidation comes in.
> 
> You are right, if we register a huge mmu interval notifier to cover
> the whole address space, then we should use dma map/unmap pages to
> map bits of the address space. We will explore this approach.
> 
> Right now, in xe driver, mmu interval notifier is dynamically
> registered with small address range. We map/unmap the whole small
> address ranges each time. So functionally it still works. But it
> might not be as performant as the method you said. 

Please don't do this, it is not how hmm_range_fault() should be
used.

It is intended to be page by page and there is no API surface
available to safely try to construct covering ranges. Drivers
definately should not try to invent such a thing.

> > Please don't use sg list at all for this.
> 
> As explained, we use sg list for device private pages so we can
> re-used the gpu page table update codes. 

I'm asking you not to use SGL lists for that too. SGL lists are not
generic data structures to hold DMA lists.

> > This is not what I'm talking about. The GPU VMA is bound to a specific
> > MM VA, it should not be created on demand.
> 
> Today we have two places where we create gpu vma: 1) create gpu vma
> during a vm_bind ioctl 2) create gpu vma during a page fault of the
> system allocator range (this will be in v2 of this series).

Don't do 2.

> I suspect something dynamic is still necessary, either a vma or a
> page table state (1 vma but many page table state created
> dynamically, as planned in our stage 2). 

I'm expecting you'd populate the page table memory on demand.

> The reason is, we still need some gpu corresponding structure to
> match the cpu vm_area_struct.

Definately not.

> For example, when gpu page fault happens, you look
> up the cpu vm_area_struct for the fault address and create a
> corresponding state/struct. And people can have as many cpu
> vm_area_struct as they want.

No you don't.

You call hmm_range_fault() and it does everything for you. A driver
should never touch CPU VMAs and must not be aware of them in any away.

Jason
Matthew Brost April 9, 2024, 5:33 p.m. UTC | #8
On Tue, Apr 09, 2024 at 10:45:22AM -0600, Zeng, Oak wrote:

Oak - A few drive by comments...

> Hi Jason
> 
> We are re-spinning this series based on the previous community feedback. I will send out a v2 soon. There were big changes compared to v1. So we would better to discuss this work with v2. 
> 
> See some reply inline.
> 
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, April 5, 2024 2:02 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> > Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com;
> > Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> > <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> > Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> > hmm range
> > 
> > On Fri, Apr 05, 2024 at 04:42:14PM +0000, Zeng, Oak wrote:
> > > > > Above codes deal with a case where dma map is not needed. As I
> > > > > understand it, whether we need a dma map depends on the devices
> > > > > topology. For example, when device access host memory or another
> > > > > device's memory through pcie, we need dma mapping; if the connection
> > > > > b/t devices is xelink (similar to nvidia's nvlink), all device's
> > > > > memory can be in same address space, so no dma mapping is needed.
> > > >
> > > > Then you call dma_map_page to do your DMA side and you avoid it for
> > > > the DEVICE_PRIVATE side. SG list doesn't help this anyhow.
> > >
> > > When dma map is needed, we used dma_map_sgtable, a different flavor
> > > of the dma-map-page function.
> > 
> > I saw, I am saying this should not be done. You cannot unmap bits of
> > a sgl mapping if an invalidation comes in.
> 
> You are right, if we register a huge mmu interval notifier to cover the whole address space, then we should use dma map/unmap pages to map bits of the address space. We will explore this approach.
> 
> Right now, in xe driver, mmu interval notifier is dynamically registered with small address range. We map/unmap the whole small address ranges each time. So functionally it still works. But it might not be as performant as the method you said. This is existing logic for our userptr codes. Our system allocator inherit this logic automatically as our system allocator design is built on top of userptr (will send out v2 soon ).We plan to make things work in the first stage then do some performance improvement like you suggested here.
>

Agree reworking the notifier design for initial phase is probably out of
scope as that would be fairly large rework. IMO if / when we switch from
a 1 to 1 relationship between VMA and PT state to a 1 to N, this is
likely when it would make sense to redesign our notifier too.

A 1 to N is basically a prerequisite for 1 notifier to work or at least
a new locking structure (maybe ref counting too?) to be able safely go
from large noifier -> invalidation chunk.

> > 
> > > The reason we also used (mis-used) sg list for non-dma-map cases, is
> > > because we want to re-use some data structure. Our goal here is, for
> > > a hmm_range, build a list of device physical address (can be
> > > dma-mapped or not), which will be used later on to program the
> > > device page table. We re-used the sg list structure for the
> > > non-dma-map cases so those two cases can share the same page table
> > > programming codes. Since sg list was originally designed for
> > > dma-map, it does look like this is mis-used here.
> > 
> > Please don't use sg list at all for this.
> 
> As explained, we use sg list for device private pages so we can re-used the gpu page table update codes. The input of the gpu page table update codes in this case is a list of dma address (in case of system memory) or device physical address (in case of device private pages). The gpu page table update codes in xe driver is pretty complicated, so re-use that codes is a preferable thing for us. If we introduce different data structure we would have to re-write part of the gpu page table update codes.
> 

Use the buddy blocks for the xe_res_cursor... We basically already have
this in place, see xe_res_first which takes a struct ttm_resource *res
as an argument and underneath uses buddy blocks for the cursor. 

> I don't see an obvious problem of this approach. But if you see this a problem, I am open to change it.
>

This should be trivial to change assuming buddy blocks are stored
somewhere (they must be, right?), so I'd do this right away.`

Only got to here, maybe reply a bit more later to below comments...

Matt
 
> > 
> > > Need to mention, even for some DEVICE_PRIVATE memory, we also need
> > > dma-mapping. For example, if you have two devices connected to each
> > > other through PCIe, both devices memory are registered as
> > > DEVICE_PRIVATE to hmm.
> > 
> > Yes, but you don't ever dma map DEVICE_PRIVATE.
> > 
> > > I believe we need a dma-map when one device access another device's
> > > memory. Two devices' memory doesn't belong to same address space in
> > > this case. For modern GPU with xeLink/nvLink/XGMI, this is not
> > > needed.
> > 
> > Review my emails here:
> > 
> > https://lore.kernel.org/dri-devel/20240403125712.GA1744080@nvidia.com/
> > 
> > Which explain how it should work.
> 
> You are right. Dma map is not needed for device private
> 
> > 
> > > > A 1:1 SVA mapping is a special case of this where there is a single
> > > > GPU VMA that spans the entire process address space with a 1:1 VA (no
> > > > offset).
> > >
> > > From implementation perspective, we can have one device page table
> > > for one process for such 1:1 va mapping, but it is not necessary to
> > > have a single gpu vma. We can have many gpu vma each cover a segment
> > > of this address space.
> > 
> > This is not what I'm talking about. The GPU VMA is bound to a specific
> > MM VA, it should not be created on demand.
> 
> Today we have two places where we create gpu vma: 1) create gpu vma during a vm_bind ioctl 2) create gpu vma during a page fault of the system allocator range (this will be in v2 of this series).
> 
> So for case 2), we do create gpu vma on demand. We also plan to explore doing this differently, such as only keep a gigantic gpu vma covering the whole address space for system allocator while only create some gpu page table state during page fault handling. This is planned for stage 2.
> 
> > 
> > If you want the full 1:1 SVA case to optimize invalidations you don't
> > need something like a VMA, 
> 
> A gpu vma (xe_vma struct in xe driver codes) is a very fundamental concept in our driver. Leveraging vma can re-use a lot of existing driver codes such as page table update.
> 
> But you are right, strictly speaking we don't need a vma. Actually in this v1 version I sent out, we don't have a gpu vma concept for system allocator. But we do have a svm range concept. We created a temporary vma for page table update purpose. Again this design will be obsolete in v2 - in v2 system allocator leverage userptr codes which incorporate with gpu vma. 
> 
> 
> a simple bitmap reducing the address space
> > into 1024 faulted in chunks or something would be much cheaper than
> > some dynamic VMA ranges.
> 
> 
> I suspect something dynamic is still necessary, either a vma or a page table state (1 vma but many page table state created dynamically, as planned in our stage 2). The reason is, we still need some gpu corresponding structure to match the cpu vm_area_struct. For example, when gpu page fault happens, you look up the cpu vm_area_struct for the fault address and create a corresponding state/struct. And people can have as many cpu vm_area_struct as they want.
> 
> Oak  
> 
> > 
> > Jason
Oak Zeng April 23, 2024, 9:17 p.m. UTC | #9
Hi Jason,

Sorry for a late reply. I have been working on a v2 of this series: https://patchwork.freedesktop.org/series/132229/. This version addressed some of your concerns, such as removing the global character device, removing svm process concept (need further clean up per Matt's feedback).

But the main concern you raised is not addressed yet. I need to further make sure I understand your concerns, See inline.



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 9, 2024 1:24 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost, Matthew
> <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> hmm range
> 
> On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> 
> > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > a sgl mapping if an invalidation comes in.
> >
> > You are right, if we register a huge mmu interval notifier to cover
> > the whole address space, then we should use dma map/unmap pages to
> > map bits of the address space. We will explore this approach.
> >
> > Right now, in xe driver, mmu interval notifier is dynamically
> > registered with small address range. We map/unmap the whole small
> > address ranges each time. So functionally it still works. But it
> > might not be as performant as the method you said.
> 
> Please don't do this, it is not how hmm_range_fault() should be
> used.
> 
> It is intended to be page by page and there is no API surface
> available to safely try to construct covering ranges. Drivers
> definately should not try to invent such a thing.

I need your help to understand this comment. Our gpu mirrors the whole CPU virtual address space. It is the first design pattern in your previous reply (entire exclusive owner of a single device private page table and fully mirrors the mm page table into the device table.) 

What do you mean by "page by page"/" no API surface available to safely try to construct covering ranges"? As I understand it, hmm_range_fault take a virtual address range (defined in hmm_range struct), and walk cpu page table in this range. It is a range based API.

From your previous reply ("So I find it a quite strange that this RFC is creating VMA's, notifiers and ranges on the fly "), it seems you are concerning why we are creating vma and register mmu interval notifier on the fly. Let me try to explain it. Xe_vma is a very fundamental concept in xe driver. The gpu page table update, invalidation are all vma-based. This concept exists before this svm work. For svm, we create a 2M (the size is user configurable) vma during gpu page fault handler and register this 2M range to mmu interval notifier.

Now I try to figure out if we don't create vma, what can we do? We can map one page (which contains the gpu fault address) to gpu page table. But that doesn't work for us because the GPU cache and TLB would not be performant for 4K page each time. One way to think of the vma is a chunk size which is good for GPU HW performance.

And the mmu notifier... if we don't register the mmu notifier on the fly, do we register one mmu notifier to cover the whole CPU virtual address space (which would be huge, e.g., 0~2^56 on a 57 bit machine, if we have half half user space kernel space split)? That won't be performant as well because for any address range that is unmapped from cpu program, but even if they are never touched by GPU, gpu driver still got a invalidation callback. In our approach, we only register a mmu notifier for address range that we know gpu would touch it. 

> 
> > > Please don't use sg list at all for this.
> >
> > As explained, we use sg list for device private pages so we can
> > re-used the gpu page table update codes.
> 
> I'm asking you not to use SGL lists for that too. SGL lists are not
> generic data structures to hold DMA lists.

Matt mentioned to use drm_buddy_block. I will see how that works out.

> 
> > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > MM VA, it should not be created on demand.
> >
> > Today we have two places where we create gpu vma: 1) create gpu vma
> > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > system allocator range (this will be in v2 of this series).
> 
> Don't do 2.

As said, we will try the approach of one gigantic gpu vma with N page table states. We will create page table states in page fault handling. But this is only planned for stage 2. 

> 
> > I suspect something dynamic is still necessary, either a vma or a
> > page table state (1 vma but many page table state created
> > dynamically, as planned in our stage 2).
> 
> I'm expecting you'd populate the page table memory on demand.

We do populate gpu page table on demand. When gpu access a virtual address, we populate the gpu page table.


> 
> > The reason is, we still need some gpu corresponding structure to
> > match the cpu vm_area_struct.
> 
> Definately not.

See explanation above.

> 
> > For example, when gpu page fault happens, you look
> > up the cpu vm_area_struct for the fault address and create a
> > corresponding state/struct. And people can have as many cpu
> > vm_area_struct as they want.
> 
> No you don't.

See explains above. I need your help to understand how we can do it without a vma (or chunk). One thing GPU driver is different from RDMA driver is, RDMA doesn't have device private memory so no migration. It only need to dma-map the system memory pages and use them to fill RDMA page table. so RDMA don't need another memory manager such as our buddy. RDMA only deal with system memory which is completely struct page based management. Page by page make 100 % sense for RDMA. 

But for gpu, we need a way to use device local memory efficiently. This is the main reason we have vma/chunk concept.

Thanks,
Oak


> 
> You call hmm_range_fault() and it does everything for you. A driver
> should never touch CPU VMAs and must not be aware of them in any away.
> 
> Jason
Matthew Brost April 24, 2024, 2:31 a.m. UTC | #10
On Tue, Apr 23, 2024 at 03:17:03PM -0600, Zeng, Oak wrote:
> Hi Jason,
> 
> Sorry for a late reply. I have been working on a v2 of this series: https://patchwork.freedesktop.org/series/132229/. This version addressed some of your concerns, such as removing the global character device, removing svm process concept (need further clean up per Matt's feedback).
> 
> But the main concern you raised is not addressed yet. I need to further make sure I understand your concerns, See inline.
> 

A few extra comments with references below.

> 
> 
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 9, 2024 1:24 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost, Matthew
> > <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com; Welty, Brian
> > <brian.welty@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> > <niranjana.vishwanathapura@intel.com>; Leon Romanovsky <leon@kernel.org>
> > Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from
> > hmm range
> > 
> > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> > 
> > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > a sgl mapping if an invalidation comes in.
> > >
> > > You are right, if we register a huge mmu interval notifier to cover
> > > the whole address space, then we should use dma map/unmap pages to
> > > map bits of the address space. We will explore this approach.
> > >
> > > Right now, in xe driver, mmu interval notifier is dynamically
> > > registered with small address range. We map/unmap the whole small
> > > address ranges each time. So functionally it still works. But it
> > > might not be as performant as the method you said.
> > 
> > Please don't do this, it is not how hmm_range_fault() should be
> > used.
> > 
> > It is intended to be page by page and there is no API surface
> > available to safely try to construct covering ranges. Drivers
> > definately should not try to invent such a thing.
>
> I need your help to understand this comment. Our gpu mirrors the whole CPU virtual address space. It is the first design pattern in your previous reply (entire exclusive owner of a single device private page table and fully mirrors the mm page table into the device table.) 
> 
> What do you mean by "page by page"/" no API surface available to safely try to construct covering ranges"? As I understand it, hmm_range_fault take a virtual address range (defined in hmm_range struct), and walk cpu page table in this range. It is a range based API.
> 
> From your previous reply ("So I find it a quite strange that this RFC is creating VMA's, notifiers and ranges on the fly "), it seems you are concerning why we are creating vma and register mmu interval notifier on the fly. Let me try to explain it. Xe_vma is a very fundamental concept in xe driver. The gpu page table update, invalidation are all vma-based. This concept exists before this svm work. For svm, we create a 2M (the size is user configurable) vma during gpu page fault handler and register this 2M range to mmu interval notifier.
> 
> Now I try to figure out if we don't create vma, what can we do? We can map one page (which contains the gpu fault address) to gpu page table. But that doesn't work for us because the GPU cache and TLB would not be performant for 4K page each time. One way to think of the vma is a chunk size which is good for GPU HW performance.
> 
> And the mmu notifier... if we don't register the mmu notifier on the fly, do we register one mmu notifier to cover the whole CPU virtual address space (which would be huge, e.g., 0~2^56 on a 57 bit machine, if we have half half user space kernel space split)? That won't be performant as well because for any address range that is unmapped from cpu program, but even if they are never touched by GPU, gpu driver still got a invalidation callback. In our approach, we only register a mmu notifier for address range that we know gpu would touch it. 
>

AMD seems to register notifiers on demand for parts of the address space
[1], I think Nvidia's open source driver does this too (can look this up
if needed). We (Intel) also do this in Xe and the i915 for userptrs
(explictly binding a user address via IOCTL) too and it seems to work
quite well.

[1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c#L130
 
> > 
> > > > Please don't use sg list at all for this.
> > >
> > > As explained, we use sg list for device private pages so we can
> > > re-used the gpu page table update codes.
> > 
> > I'm asking you not to use SGL lists for that too. SGL lists are not
> > generic data structures to hold DMA lists.
> 
> Matt mentioned to use drm_buddy_block. I will see how that works out.
> 

Probably actually build a iterator (xe_res_cursor) for the device pages
returned from hmm_range_fault now that I think about this more.

> > 
> > > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > > MM VA, it should not be created on demand.
> > >
> > > Today we have two places where we create gpu vma: 1) create gpu vma
> > > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > > system allocator range (this will be in v2 of this series).
> > 
> > Don't do 2.

You have to create something, actually 2 things, on a GPU page fault.
Something to track the page table state and something to track VRAM
memory allocation. Both AMD and Nvidia's open source driver do this.

In AMD's driver the page table state is svm_range [2] and VRAM state is
svm_range_bo [3]. 

Nvidia's open source driver also does something similar (again can track
down a ref if needed).

Conceptually Xe will do something similiar, these are trending towards
an xe_vma and xe_bo respectfully, TBD on exact details but concept is
solid.

[2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_svm.h#L109
[3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_svm.h#L42 

> 
> As said, we will try the approach of one gigantic gpu vma with N page table states. We will create page table states in page fault handling. But this is only planned for stage 2. 
> 
> > 
> > > I suspect something dynamic is still necessary, either a vma or a
> > > page table state (1 vma but many page table state created
> > > dynamically, as planned in our stage 2).
> > 
> > I'm expecting you'd populate the page table memory on demand.
> 
> We do populate gpu page table on demand. When gpu access a virtual address, we populate the gpu page table.
> 
> 
> > 
> > > The reason is, we still need some gpu corresponding structure to
> > > match the cpu vm_area_struct.
> > 
> > Definately not.
> 
> See explanation above.
> 

Agree GPU doesn't need to match vm_area_struct but the allocation must
be subset (or equal) to a vm_area_struct. Again other driver do this
too.

e.g. You can't allocate a 2MB chunk if the vma_area_struct looked up is
only 64k.

> > 
> > > For example, when gpu page fault happens, you look
> > > up the cpu vm_area_struct for the fault address and create a
> > > corresponding state/struct. And people can have as many cpu
> > > vm_area_struct as they want.
> > 
> > No you don't.

Yes you do. See below.

> 
> See explains above. I need your help to understand how we can do it without a vma (or chunk). One thing GPU driver is different from RDMA driver is, RDMA doesn't have device private memory so no migration. It only need to dma-map the system memory pages and use them to fill RDMA page table. so RDMA don't need another memory manager such as our buddy. RDMA only deal with system memory which is completely struct page based management. Page by page make 100 % sense for RDMA. 
> 
> But for gpu, we need a way to use device local memory efficiently. This is the main reason we have vma/chunk concept.
> 
> Thanks,
> Oak
> 
> 
> > 
> > You call hmm_range_fault() and it does everything for you. A driver
> > should never touch CPU VMAs and must not be aware of them in any away.
> > 

struct vm_area_struct is an argument to the migrate_vma* functions [4], so
yes drivers need to be aware of CPU VMAs.

Again AMD [5], Nouveau [6] , and Nvidia's open source driver (again no
ref, but can dig one up) all lookup CPU VMAs on a GPU page fault or SVM
bind IOCTL.

[4] https://elixir.bootlin.com/linux/latest/source/include/linux/migrate.h#L186
[5] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c#L522
[6] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_svm.c#L182

Matt

> > Jason
Jason Gunthorpe April 24, 2024, 1:48 p.m. UTC | #11
On Tue, Apr 23, 2024 at 09:17:03PM +0000, Zeng, Oak wrote:
> > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> > 
> > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > a sgl mapping if an invalidation comes in.
> > >
> > > You are right, if we register a huge mmu interval notifier to cover
> > > the whole address space, then we should use dma map/unmap pages to
> > > map bits of the address space. We will explore this approach.
> > >
> > > Right now, in xe driver, mmu interval notifier is dynamically
> > > registered with small address range. We map/unmap the whole small
> > > address ranges each time. So functionally it still works. But it
> > > might not be as performant as the method you said.
> > 
> > Please don't do this, it is not how hmm_range_fault() should be
> > used.
> > 
> > It is intended to be page by page and there is no API surface
> > available to safely try to construct covering ranges. Drivers
> > definately should not try to invent such a thing.
> 
> I need your help to understand this comment. Our gpu mirrors the
> whole CPU virtual address space. It is the first design pattern in
> your previous reply (entire exclusive owner of a single device
> private page table and fully mirrors the mm page table into the
> device table.)
> 
> What do you mean by "page by page"/" no API surface available to
> safely try to construct covering ranges"? As I understand it,
> hmm_range_fault take a virtual address range (defined in hmm_range
> struct), and walk cpu page table in this range. It is a range based
> API.

Yes, but invalidation is not linked to range_fault so you can get
invalidations of single pages. You are binding range_fault to
dma_map_sg but then you can't handle invalidation at all sanely.

> From your previous reply ("So I find it a quite strange that this
> RFC is creating VMA's, notifiers and ranges on the fly "), it seems
> you are concerning why we are creating vma and register mmu interval
> notifier on the fly. Let me try to explain it. Xe_vma is a very
> fundamental concept in xe driver. 

I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
and you do need to ensure the page table manipulation has an API that
is usable. "populate an entire VMA" / "invalidate an entire VMA" is
not a sufficient primitive.

> The gpu page table update, invalidation are all vma-based. This
> concept exists before this svm work. For svm, we create a 2M (the
> size is user configurable) vma during gpu page fault handler and
> register this 2M range to mmu interval notifier.

You can create VMAs, but they can't be fully populated and they should
be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
SVA, the "vma" granual is 512M.

The key thing is the VMA is just a container for the notifier and
other data structures, it doesn't insist the range be fully populated
and it must support page-by-page unmap/remap/invalidateion.

> Now I try to figure out if we don't create vma, what can we do? We
> can map one page (which contains the gpu fault address) to gpu page
> table. But that doesn't work for us because the GPU cache and TLB
> would not be performant for 4K page each time. One way to think of
> the vma is a chunk size which is good for GPU HW performance.

From this perspective invalidation is driven by the range the
invalidation callback gets, it could be a single page, but probably
bigger.

mapping is driven by the range passed to hmm_range_fault() during
fault handling, which is entirely based on your drivers prefetch
logic.

GPU TLB invalidation sequences should happen once, at the end, for any
invalidation or range_fault sequence regardless. Nothing about "gpu
vmas" should have anything to do with this.

> And the mmu notifier... if we don't register the mmu notifier on the
> fly, do we register one mmu notifier to cover the whole CPU virtual
> address space (which would be huge, e.g., 0~2^56 on a 57 bit
> machine, if we have half half user space kernel space split)? That
> won't be performant as well because for any address range that is
> unmapped from cpu program, but even if they are never touched by
> GPU, gpu driver still got a invalidation callback. In our approach,
> we only register a mmu notifier for address range that we know gpu
> would touch it.

When SVA is used something, somewhere, has to decide if a CPU VA
intersects with a HW VA.

The mmu notifiers are orginized in an interval (red/black) tree, so if
you have a huge number of them the RB search becomes very expensive.

Conversly your GPU page table is organized in a radix tree, so
detecting no-presence during invalidation is a simple radix
walk. Indeed for the obviously unused ranges it is probably a pointer
load and single de-ref to check it.

I fully expect the radix walk is much, much faster than a huge number
of 2M notifiers in the red/black tree.

Notifiers for SVA cases should be giant. If not the entire memory
space, then at least something like 512M/1G kind of size, neatly
aligned with something in your page table structure so the radix walk
can start lower in the tree automatically.

> > > For example, when gpu page fault happens, you look
> > > up the cpu vm_area_struct for the fault address and create a
> > > corresponding state/struct. And people can have as many cpu
> > > vm_area_struct as they want.
> > 
> > No you don't.
> 
> See explains above. I need your help to understand how we can do it
> without a vma (or chunk). One thing GPU driver is different from
> RDMA driver is, RDMA doesn't have device private memory so no
> migration. 

I want to be very clear, there should be no interaction of your
hmm_range_fault based code with MM centric VMAs. You MUST NOT look up
the CPU vma_area_struct in your driver.

> It only need to dma-map the system memory pages and use
> them to fill RDMA page table. so RDMA don't need another memory
> manager such as our buddy. RDMA only deal with system memory which
> is completely struct page based management. Page by page make 100 %
> sense for RDMA.

I don't think this is the issue at hand, you just have some historical
poorly designed page table manipulation code from what I can
understand..

Jason
Jason Gunthorpe April 24, 2024, 1:57 p.m. UTC | #12
On Wed, Apr 24, 2024 at 02:31:36AM +0000, Matthew Brost wrote:

> AMD seems to register notifiers on demand for parts of the address space
> [1], I think Nvidia's open source driver does this too (can look this up
> if needed). We (Intel) also do this in Xe and the i915 for userptrs
> (explictly binding a user address via IOCTL) too and it seems to work
> quite well.

I always thought AMD's implementation of this stuff was bad..

> > > > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > > > MM VA, it should not be created on demand.
> > > >
> > > > Today we have two places where we create gpu vma: 1) create gpu vma
> > > > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > > > system allocator range (this will be in v2 of this series).
> > > 
> > > Don't do 2.
> 
> You have to create something, actually 2 things, on a GPU page fault.
> Something to track the page table state and something to track VRAM
> memory allocation. Both AMD and Nvidia's open source driver do this.

VRAM memory allocation should be tracked by the mm side, under the
covers of hmm_range_fault (or migration prior to invoke
hmm_range_fault).

VRAM memory allocation or management has nothing to do with SVA.

From there the only need is to copy hmm_range_fault results into GPU
PTEs. You definately do not *need* some other data structure.

> > > > The reason is, we still need some gpu corresponding structure to
> > > > match the cpu vm_area_struct.
> > > 
> > > Definately not.
> > 
> > See explanation above.
> 
> Agree GPU doesn't need to match vm_area_struct but the allocation must
> be subset (or equal) to a vm_area_struct. Again other driver do this
> too.

No, absolutely not. There can be no linking of CPU vma_area_struct to
how a driver operates hmm_range_fault().

You probably need to do something like this for your migration logic,
but that is seperate.

> > > You call hmm_range_fault() and it does everything for you. A driver
> > > should never touch CPU VMAs and must not be aware of them in any away.
> 
> struct vm_area_struct is an argument to the migrate_vma* functions [4], so
> yes drivers need to be aware of CPU VMAs.

That is something else. If you want to mess with migration during your
GPU fault path then fine that is some "migration module", but it
should have NOTHING to do with how hmm_range_fault() is invoked or how
the *SVA* flow operates.

You are mixing things up here, this thread is talking about
hmm_range_fault() and SVA.

migration is something that happens before doing the SVA mirroring
flows.

Jason
Matthew Brost April 24, 2024, 4:35 p.m. UTC | #13
On Wed, Apr 24, 2024 at 10:57:54AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 24, 2024 at 02:31:36AM +0000, Matthew Brost wrote:
> 
> > AMD seems to register notifiers on demand for parts of the address space
> > [1], I think Nvidia's open source driver does this too (can look this up
> > if needed). We (Intel) also do this in Xe and the i915 for userptrs
> > (explictly binding a user address via IOCTL) too and it seems to work
> > quite well.
> 
> I always thought AMD's implementation of this stuff was bad..
> 

No comment on the quality of AMD's implementaion.

But in general the view among my team members that registering notifiers
on demand for sub ranges is an accepted practice. If we find the perf
for this is terrible, it would be fairly easy to switch to 1 large
notifier mirroring the CPU entire address space. This is a small design
detail IMO.

> > > > > > This is not what I'm talking about. The GPU VMA is bound to a specific
> > > > > > MM VA, it should not be created on demand.
> > > > >
> > > > > Today we have two places where we create gpu vma: 1) create gpu vma
> > > > > during a vm_bind ioctl 2) create gpu vma during a page fault of the
> > > > > system allocator range (this will be in v2 of this series).
> > > > 
> > > > Don't do 2.
> > 
> > You have to create something, actually 2 things, on a GPU page fault.
> > Something to track the page table state and something to track VRAM
> > memory allocation. Both AMD and Nvidia's open source driver do this.
> 
> VRAM memory allocation should be tracked by the mm side, under the
> covers of hmm_range_fault (or migration prior to invoke
> hmm_range_fault).
> 

Yes, the above step I describe is optionally done *before* calling hmm
range fault.

> VRAM memory allocation or management has nothing to do with SVA.
> 
> From there the only need is to copy hmm_range_fault results into GPU
> PTEs. You definately do not *need* some other data structure.
> 

You do not *need* some other data structure as you could always just
walk the page tables but in practice a data structure exists in a tree
of shorts with the key being a VA range. The data structure has meta
data about the mapping, all GPU drivers seem to have this. This data
structure, along with pages returned from hmm_range_fault, are used to
program the GPU PTEs.

Again the allocation of this data structure happens *before* calling
hmm_range_fault on first GPU fault within unmapped range.

> > > > > The reason is, we still need some gpu corresponding structure to
> > > > > match the cpu vm_area_struct.
> > > > 
> > > > Definately not.
> > > 
> > > See explanation above.
> > 
> > Agree GPU doesn't need to match vm_area_struct but the allocation must
> > be subset (or equal) to a vm_area_struct. Again other driver do this
> > too.
> 
> No, absolutely not. There can be no linking of CPU vma_area_struct to
> how a driver operates hmm_range_fault().
>

Agree. Once we are calling hmm_range_fault vma_area_struct is out of the
picture.

We also will never store a vma_area_struct in our driver, it is only
looked up on demand when required for migration.
 
> You probably need to do something like this for your migration logic,
> but that is seperate.
> 

Yes.

> > > > You call hmm_range_fault() and it does everything for you. A driver
> > > > should never touch CPU VMAs and must not be aware of them in any away.
> > 
> > struct vm_area_struct is an argument to the migrate_vma* functions [4], so
> > yes drivers need to be aware of CPU VMAs.
> 
> That is something else. If you want to mess with migration during your
> GPU fault path then fine that is some "migration module", but it
> should have NOTHING to do with how hmm_range_fault() is invoked or how
> the *SVA* flow operates.
>

Agree. hmm_range_fault is invoked in a opaque way whether backing store
is SRAM or VRAM and flows in a uniform way. The only difference is how we
resolve the hmm pfn to a value in the GPU page tables (device private
pages a device physical addresses, while CPU pages are dma mapped).

> You are mixing things up here, this thread is talking about
> hmm_range_fault() and SVA.
>

I figure there was a level of confusion in this thread. I think we are
now aligned?

Thanks for your time.
Matt

> migration is something that happens before doing the SVA mirroring
> flows.
> 
> Jason
Jason Gunthorpe April 24, 2024, 4:44 p.m. UTC | #14
On Wed, Apr 24, 2024 at 04:35:17PM +0000, Matthew Brost wrote:
> On Wed, Apr 24, 2024 at 10:57:54AM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 24, 2024 at 02:31:36AM +0000, Matthew Brost wrote:
> > 
> > > AMD seems to register notifiers on demand for parts of the address space
> > > [1], I think Nvidia's open source driver does this too (can look this up
> > > if needed). We (Intel) also do this in Xe and the i915 for userptrs
> > > (explictly binding a user address via IOCTL) too and it seems to work
> > > quite well.
> > 
> > I always thought AMD's implementation of this stuff was bad..
> 
> No comment on the quality of AMD's implementaion.
> 
> But in general the view among my team members that registering notifiers
> on demand for sub ranges is an accepted practice.

Yes, but not on a 2M granual, and not without sparsity. Do it on
something like an aligned 512M and it would be fairly reasonable.

> You do not *need* some other data structure as you could always just
> walk the page tables but in practice a data structure exists in a tree
> of shorts with the key being a VA range. The data structure has meta
> data about the mapping, all GPU drivers seem to have this. 

What "meta data" is there for a SVA mapping? The entire page table is
an SVA.

> structure, along with pages returned from hmm_range_fault, are used to
> program the GPU PTEs.

Most likely pages returned from hmm_range_fault() can just be stored
directly in the page table's PTEs. I'd be surprised if you actually
need seperate storage. (ignoring some of the current issues with the
DMA API)

> Again the allocation of this data structure happens *before* calling
> hmm_range_fault on first GPU fault within unmapped range.

The SVA page table and hmm_range_fault are tightly connected together,
if a vma is needed to make it work then it is not "before", it is
part of.

Jason
Matthew Brost April 24, 2024, 4:56 p.m. UTC | #15
On Wed, Apr 24, 2024 at 01:44:11PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 24, 2024 at 04:35:17PM +0000, Matthew Brost wrote:
> > On Wed, Apr 24, 2024 at 10:57:54AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Apr 24, 2024 at 02:31:36AM +0000, Matthew Brost wrote:
> > > 
> > > > AMD seems to register notifiers on demand for parts of the address space
> > > > [1], I think Nvidia's open source driver does this too (can look this up
> > > > if needed). We (Intel) also do this in Xe and the i915 for userptrs
> > > > (explictly binding a user address via IOCTL) too and it seems to work
> > > > quite well.
> > > 
> > > I always thought AMD's implementation of this stuff was bad..
> > 
> > No comment on the quality of AMD's implementaion.
> > 
> > But in general the view among my team members that registering notifiers
> > on demand for sub ranges is an accepted practice.
> 
> Yes, but not on a 2M granual, and not without sparsity. Do it on
> something like an aligned 512M and it would be fairly reasonable.
> 
> > You do not *need* some other data structure as you could always just
> > walk the page tables but in practice a data structure exists in a tree
> > of shorts with the key being a VA range. The data structure has meta
> > data about the mapping, all GPU drivers seem to have this. 
> 
> What "meta data" is there for a SVA mapping? The entire page table is
> an SVA.
> 

If we have allocated memory for GPU page tables in the range, if range
has been invalidated, notifier seqno, what type of mapping this is (SVA,
BO, userptr, NULL)...

The "meta data" covers all types of mappings, not just SVA. SVA is a
specific class of the "meta data".

> > structure, along with pages returned from hmm_range_fault, are used to
> > program the GPU PTEs.
> 
> Most likely pages returned from hmm_range_fault() can just be stored
> directly in the page table's PTEs. I'd be surprised if you actually
> need seperate storage. (ignoring some of the current issues with the
> DMA API)
> 

In theory that could work but again practice this not how it is done.
The "meta data" cover all the classes mapping mentioned above. Our PTE
programming code needs to be handle all the different requirements of
these specific classes in a single code path.

> > Again the allocation of this data structure happens *before* calling
> > hmm_range_fault on first GPU fault within unmapped range.
> 
> The SVA page table and hmm_range_fault are tightly connected together,
> if a vma is needed to make it work then it is not "before", it is
> part of.
>

It is companion data for the GPU page table walk. See above
explaination.

Matt
 
> Jason
Jason Gunthorpe April 24, 2024, 5:48 p.m. UTC | #16
On Wed, Apr 24, 2024 at 04:56:57PM +0000, Matthew Brost wrote:
> > What "meta data" is there for a SVA mapping? The entire page table is
> > an SVA.
> 
> If we have allocated memory for GPU page tables in the range,

This is encoded directly in the radix tree.

> if range
> has been invalidated, 

As I keep saying these ranges need sparsity, so you have to store
per-page in the radix tree if it is invalidated. There should be no
need for a per-range tracking.

> notifier seqno, 

? The mmu notifier infrastructure handles seqno. You need a mmu
notifier someplace that covers that SVA, or fractionally covers it,
but that is not tied to the PTEs.

> what type of mapping this is (SVA,
> BO, userptr, NULL)...

Which is back to a whole "SVA" type "vma" that covers the entire GPU
page table when you bind the mm in the first place.

> > > structure, along with pages returned from hmm_range_fault, are used to
> > > program the GPU PTEs.
> > 
> > Most likely pages returned from hmm_range_fault() can just be stored
> > directly in the page table's PTEs. I'd be surprised if you actually
> > need seperate storage. (ignoring some of the current issues with the
> > DMA API)
> In theory that could work but again practice this not how it is done.
> The "meta data" cover all the classes mapping mentioned above. Our PTE
> programming code needs to be handle all the different requirements of
> these specific classes in a single code path.

Which is back to my first thesis, this is all about trying to bolt on
an existing PTE scheme which is ill suited to the needs of SVA and
hmm_range_fault.

Jason
Oak Zeng April 24, 2024, 11:59 p.m. UTC | #17
Hi Jason,

I went through the conversation b/t you and Matt. I think we are pretty much aligned. Here is what I get from this threads:

1) hmm range fault size, gpu page table map size : you prefer bigger gpu vma size and vma can be sparsely mapped to gpu. Our vma size is configurable through a user madvise api. We do plan to try 1 gigantic vma and sparse mapping. That requires us to reconstruct driver for a 1vma: N page table state mapping. This will be stage 2 work

2) invalidation: you prefer giant notifier. We can consider this if it turns out our implementation is not performant. Currently we don't know.

3) whether driver can look up cpu vma. I think we need this for data migration purpose.


See also comments inline.


> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 24, 2024 9:49 AM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> Matthew <matthew.brost@intel.com>; Thomas.Hellstrom@linux.intel.com;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky
> <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Tue, Apr 23, 2024 at 09:17:03PM +0000, Zeng, Oak wrote:
> > > On Tue, Apr 09, 2024 at 04:45:22PM +0000, Zeng, Oak wrote:
> > >
> > > > > I saw, I am saying this should not be done. You cannot unmap bits of
> > > > > a sgl mapping if an invalidation comes in.
> > > >
> > > > You are right, if we register a huge mmu interval notifier to cover
> > > > the whole address space, then we should use dma map/unmap pages
> to
> > > > map bits of the address space. We will explore this approach.
> > > >
> > > > Right now, in xe driver, mmu interval notifier is dynamically
> > > > registered with small address range. We map/unmap the whole small
> > > > address ranges each time. So functionally it still works. But it
> > > > might not be as performant as the method you said.
> > >
> > > Please don't do this, it is not how hmm_range_fault() should be
> > > used.
> > >
> > > It is intended to be page by page and there is no API surface
> > > available to safely try to construct covering ranges. Drivers
> > > definately should not try to invent such a thing.
> >
> > I need your help to understand this comment. Our gpu mirrors the
> > whole CPU virtual address space. It is the first design pattern in
> > your previous reply (entire exclusive owner of a single device
> > private page table and fully mirrors the mm page table into the
> > device table.)
> >
> > What do you mean by "page by page"/" no API surface available to
> > safely try to construct covering ranges"? As I understand it,
> > hmm_range_fault take a virtual address range (defined in hmm_range
> > struct), and walk cpu page table in this range. It is a range based
> > API.
> 
> Yes, but invalidation is not linked to range_fault so you can get
> invalidations of single pages. You are binding range_fault to
> dma_map_sg but then you can't handle invalidation at all sanely.

Ok, I understand your point now.

Yes strictly speaking we can get invalidation of a single page. This can be triggered by core mm numa balance or ksm (kernel samepage merging). At present, my understanding is, single page (or a few pages) invalidation is not a very common case. The more common cases are invalidation triggered by user munmap, or invalidation triggered by hmm migration itself (triggered in migrate_vma_setup). I will experiment this.

User munmap obviously triggers range based invalidation.

The invalidation triggered by hmm vma migration is also range based as we select to migration at vma granularity due to performance considerations as explained.

I agree in case of single page invalidation, the current codes is not performant because we invalidate the whole vma. What I can do is, look at the mmu_notifier_range parameter of the invalidation callback, and only invalidate the range. The requires our driver to split the vma state and page table state. It is a big change. We plan to do it in stage 2

> 
> > From your previous reply ("So I find it a quite strange that this
> > RFC is creating VMA's, notifiers and ranges on the fly "), it seems
> > you are concerning why we are creating vma and register mmu interval
> > notifier on the fly. Let me try to explain it. Xe_vma is a very
> > fundamental concept in xe driver.
> 
> I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
> and you do need to ensure the page table manipulation has an API that
> is usable. "populate an entire VMA" / "invalidate an entire VMA" is
> not a sufficient primitive.

I understand invalidate entire VMA might be not performant. I will improve as explained above.

I think whether we want to populate entire VMA or only one page is still driver's selection. For us, populate a whole VMA (at 2M bytes by default but overwritable by user) is still a performant option. If we populate one page per time, we would run into continuous gpu page fault when gpu access the following pages. In most of our compute workload, gpu need to process big chunk of data, e.g., many MiB or even GiB. And page fault overhead is huge per our measurement.

Do you suggest per page based population? Or do you suggest to populate the entire address space or the entire memory region? I did look at RDMA odp codes. In function ib_umem_odp_map_dma_and_lock, it is also a range based population. It seems it populate the whole memory region each time, not very sure. 

> 
> > The gpu page table update, invalidation are all vma-based. This
> > concept exists before this svm work. For svm, we create a 2M (the
> > size is user configurable) vma during gpu page fault handler and
> > register this 2M range to mmu interval notifier.
> 
> You can create VMAs, but they can't be fully populated and they should
> be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
> SVA, the "vma" granual is 512M.

Oh, I see. So you are suggesting a much bigger granularity. That make sense to me. Our design actually support a much bigger granularity. The migration/population granularity is configurable in our design. It is a memory advise API and one of the advise is called "MIGRATION_GRANULARITY". This part of the codes is not in my series yet as it is being work by Himal who is also on this email list. We will publish that work soon for review. 

> 
> The key thing is the VMA is just a container for the notifier and
> other data structures, it doesn't insist the range be fully populated
> and it must support page-by-page unmap/remap/invalidateion.

Agree and I don't see a hard conflict of our implementation to this concept. So the linux core mm vma (struct vm_area_struct) represent an address range in the process's address space. Xe driver would create some xe_vma to cover a sub-region of a core mm vma. For example, if the core mm vma is 1GiB, we might create xe-vma of 512MiB or 2MiB, depending on our MIGRATION_GRANULARITY setting. 

As explained, we can support page-by page map/unmap. Our design makes sure we can map/unmap at any page boundary if we want. The granularity setting is all depends on performance data.


> 
> > Now I try to figure out if we don't create vma, what can we do? We
> > can map one page (which contains the gpu fault address) to gpu page
> > table. But that doesn't work for us because the GPU cache and TLB
> > would not be performant for 4K page each time. One way to think of
> > the vma is a chunk size which is good for GPU HW performance.
> 
> From this perspective invalidation is driven by the range the
> invalidation callback gets, it could be a single page, but probably
> bigger.

Agree


> 
> mapping is driven by the range passed to hmm_range_fault() during
> fault handling, which is entirely based on your drivers prefetch
> logic.

In our driver, mapping can be triggered by either prefetch or fault. 

Prefetch is a user API so user can decide the range.

The range used in fault is decided by MIGRATION_GRANULARITY user setting. The default value is 2MiB as said. 


> 
> GPU TLB invalidation sequences should happen once, at the end, for any
> invalidation or range_fault sequence regardless. Nothing about "gpu
> vmas" should have anything to do with this.
> 
> > And the mmu notifier... if we don't register the mmu notifier on the
> > fly, do we register one mmu notifier to cover the whole CPU virtual
> > address space (which would be huge, e.g., 0~2^56 on a 57 bit
> > machine, if we have half half user space kernel space split)? That
> > won't be performant as well because for any address range that is
> > unmapped from cpu program, but even if they are never touched by
> > GPU, gpu driver still got a invalidation callback. In our approach,
> > we only register a mmu notifier for address range that we know gpu
> > would touch it.
> 
> When SVA is used something, somewhere, has to decide if a CPU VA
> intersects with a HW VA.
> 
> The mmu notifiers are orginized in an interval (red/black) tree, so if
> you have a huge number of them the RB search becomes very expensive.
> 
> Conversly your GPU page table is organized in a radix tree, so
> detecting no-presence during invalidation is a simple radix
> walk. Indeed for the obviously unused ranges it is probably a pointer
> load and single de-ref to check it.
> 
> I fully expect the radix walk is much, much faster than a huge number
> of 2M notifiers in the red/black tree.
> 
> Notifiers for SVA cases should be giant. If not the entire memory
> space, then at least something like 512M/1G kind of size, neatly
> aligned with something in your page table structure so the radix walk
> can start lower in the tree automatically.

In our implementation, our page table is not organized as a radix tree. Maybe this an area we can improve. For validation, we don't need to walk page table to figure out which range is present in page table. Since we only register a mmu notifier when a range is actually mapped to gpu page table. So all the notifier callback is with a valid range in gpu page table.

I agree many 2M small notifiers can slow down the red/black tree walk from core mm side. But with giant notifier, core mm callback driver much more times than small notifier - driver would be called back even if a range is not mapped to gpu page table.

So I am not sure which approach is faster. But I can experiment this.


> 
> > > > For example, when gpu page fault happens, you look
> > > > up the cpu vm_area_struct for the fault address and create a
> > > > corresponding state/struct. And people can have as many cpu
> > > > vm_area_struct as they want.
> > >
> > > No you don't.
> >
> > See explains above. I need your help to understand how we can do it
> > without a vma (or chunk). One thing GPU driver is different from
> > RDMA driver is, RDMA doesn't have device private memory so no
> > migration.
> 
> I want to be very clear, there should be no interaction of your
> hmm_range_fault based code with MM centric VMAs. You MUST NOT look
> up
> the CPU vma_area_struct in your driver.

Without look up cpu vma, we even can't decide whether our gpu accessed an valid address or not.

When GPU accesses an address, valid or not (e.g., out of bound access), hardware generates a page fault. If we don't look up cpu vma, how do we determine whether gpu has a out of bound access?

Further more, we call hmm helpers migrate_vma_setup for migration which take a struct migrate_vma parameter. Migrate_vma has a vma field. If we don't look up cpu vma, how do we get this field?

Oak


> 
> > It only need to dma-map the system memory pages and use
> > them to fill RDMA page table. so RDMA don't need another memory
> > manager such as our buddy. RDMA only deal with system memory which
> > is completely struct page based management. Page by page make 100 %
> > sense for RDMA.
> 
> I don't think this is the issue at hand, you just have some historical
> poorly designed page table manipulation code from what I can
> understand..
> 
> Jason
Jason Gunthorpe April 25, 2024, 1:05 a.m. UTC | #18
On Wed, Apr 24, 2024 at 11:59:18PM +0000, Zeng, Oak wrote:
> Hi Jason,
> 
> I went through the conversation b/t you and Matt. I think we are pretty much aligned. Here is what I get from this threads:
> 
> 1) hmm range fault size, gpu page table map size : you prefer bigger
> gpu vma size and vma can be sparsely mapped to gpu. Our vma size is
> configurable through a user madvise api. 

That is even worse! It is not a user tunable in any way shape or form!

> 2) invalidation: you prefer giant notifier. We can consider this if
> it turns out our implementation is not performant. Currently we
> don't know.

It is the wrong way to use the API to have many small notifiers,
please don't use it wrong.
 
> 3) whether driver can look up cpu vma. I think we need this for data migration purpose.

The migration code may but not the SVA/hmm_range_fault code. 

> > > What do you mean by "page by page"/" no API surface available to
> > > safely try to construct covering ranges"? As I understand it,
> > > hmm_range_fault take a virtual address range (defined in hmm_range
> > > struct), and walk cpu page table in this range. It is a range based
> > > API.
> > 
> > Yes, but invalidation is not linked to range_fault so you can get
> > invalidations of single pages. You are binding range_fault to
> > dma_map_sg but then you can't handle invalidation at all sanely.
> 
> Ok, I understand your point now.
> 
> Yes strictly speaking we can get invalidation of a single page. This
> can be triggered by core mm numa balance or ksm (kernel samepage
> merging). At present, my understanding is, single page (or a few
> pages) invalidation is not a very common case. The more common cases
> are invalidation triggered by user munmap, or invalidation triggered
> by hmm migration itself (triggered in migrate_vma_setup). I will
> experiment this.

Regardless it must be handled and unmapping an entire 'gpu vma' is a
horrible implementation of HMM.

> I agree in case of single page invalidation, the current codes is
> not performant because we invalidate the whole vma. What I can do
> is, look at the mmu_notifier_range parameter of the invalidation
> callback, and only invalidate the range. The requires our driver to
> split the vma state and page table state. It is a big change. We
> plan to do it in stage 2

Which is, again, continuing to explain why why this VMA based approach
is a completely wrong way to use hmm.

> > I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
> > and you do need to ensure the page table manipulation has an API that
> > is usable. "populate an entire VMA" / "invalidate an entire VMA" is
> > not a sufficient primitive.
> 
> I understand invalidate entire VMA might be not performant. I will
> improve as explained above.

Please stop saying performant. There are correct ways to use hmm and
bad ways. What you are doing is a bad way. Even if the performance is
only a little different it is still the kind of horrible code
structure I don't want to see in new hmm users.

> I think whether we want to populate entire VMA or only one page is
> still driver's selection. 

hmm_range_fault() should be driven with a prefetch fault/around
scheme. This has nothing to do with a durable VMA record and addresses
these concerns.

> Do you suggest per page based population? Or do you suggest to
> populate the entire address space or the entire memory region? I did
> look at RDMA odp codes. In function ib_umem_odp_map_dma_and_lock, it
> is also a range based population. It seems it populate the whole
> memory region each time, not very sure.

Performance here is veyr complicated. You often want to allow the
userspace to prefetch data into the GPU page table to warm it up to
avoid page faults, as faults are generally super slow and hard to
manage performantly. ODP has many options to control this in a fine
granularity.

> > You can create VMAs, but they can't be fully populated and they should
> > be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
> > SVA, the "vma" granual is 512M.
> 
> Oh, I see. So you are suggesting a much bigger granularity. That
> make sense to me. Our design actually support a much bigger
> granularity. The migration/population granularity is configurable in
> our design. It is a memory advise API and one of the advise is
> called "MIGRATION_GRANULARITY".

I don't think the notifier granual should be user tunable, you are
actually doing something different - it sounds like it mostly acts as
prefetch tunable.

> > The key thing is the VMA is just a container for the notifier and
> > other data structures, it doesn't insist the range be fully populated
> > and it must support page-by-page unmap/remap/invalidateion.
> 
> Agree and I don't see a hard conflict of our implementation to this
> concept. So the linux core mm vma (struct vm_area_struct) represent
> an address range in the process's address space. Xe driver would
> create some xe_vma to cover a sub-region of a core mm vma. 

No again and again NO. hmm_range_fault users are not, and must not be
linked to CPU VMAs ever.

> > mapping is driven by the range passed to hmm_range_fault() during
> > fault handling, which is entirely based on your drivers prefetch
> > logic.
> 
> In our driver, mapping can be triggered by either prefetch or fault. 
> 
> Prefetch is a user API so user can decide the range.

> The range used in fault is decided by MIGRATION_GRANULARITY user
> setting. The default value is 2MiB as said.

Which is basically turning it into a fault prefetch tunable.
 
> In our implementation, our page table is not organized as a radix
> tree. 

Huh? What does the HW in the GPU do to figure out how issue DMAs? Your
GPU HW doesn't have a radix tree walker you are configuring?

> Maybe this an area we can improve. For validation, we don't
> need to walk page table to figure out which range is present in page
> table. Since we only register a mmu notifier when a range is
> actually mapped to gpu page table. So all the notifier callback is
> with a valid range in gpu page table.

But this is not page by page.

> I agree many 2M small notifiers can slow down the red/black tree
> walk from core mm side. But with giant notifier, core mm callback
> driver much more times than small notifier - driver would be called
> back even if a range is not mapped to gpu page table.

The driver should do a cheaper check than a red/black check if it
needs to do any work, eg with a typical radix page table, or some kind
of fast sparse bitmap. At a 2M granual and 100GB workloads these days
this is an obviously loosing idea.

> > I want to be very clear, there should be no interaction of your
> > hmm_range_fault based code with MM centric VMAs. You MUST NOT look
> > up
> > the CPU vma_area_struct in your driver.
> 
> Without look up cpu vma, we even can't decide whether our gpu
> accessed an valid address or not.

hmm_range_fault provides this for you.
 
> Further more, we call hmm helpers migrate_vma_setup for migration
> which take a struct migrate_vma parameter. Migrate_vma has a vma
> field. If we don't look up cpu vma, how do we get this field?

Migration is a different topic. The vma lookup for a migration has
nothing to do with hmm_range_fault and SVA.

(and perhaps arguably this is structured wrong as you probably want
hmm_range_fault to do the migrations for you as it walks the range to
avoid double walks and things)

Jason
Thomas Hellstrom April 26, 2024, 9:55 a.m. UTC | #19
Hi, Jason.

I've quickly read through the discussion here and have a couple of
questions and clarifications to hopefully help moving forward on
aligning on an approach for this.

Let's for simplicity initially ignore migration and assume this is on
integrated hardware since it seems like it's around the
hmm_range_fault() usage there is a disconnect.

First, the gpu_vma structure is something that partitions the gpu_vm
that holds gpu-related range metadata, like what to mirror, desired gpu
caching policies etc. These are managed (created, removed and split)
mainly from user-space. These are stored and looked up from an rb-tree.

Each such mirroring gpu_vma holds an mmu_interval notifier.

For invalidation only purposes, the mmu_interval seqno is not tracked.
An invalidation thus only zaps page-table entries, causing subsequent
accesses to fault. Hence for this purpose, having a single notifier
that covers a huge range is desirable and does not become a problem.

Now, when we hit a fault, we want to use hmm_range_fault() to re-
populate the faulting PTE, but also to pre-fault a range. Using a range
here (let's call this a prefault range for clarity) rather than to
insert a single PTE is for multiple reasons:

1) avoid subsequent adjacent faults
2a) Using huge GPU page-table entries.
2b) Updating the GPU page-table (tree-based and multi-level) becomes
more efficient when using a range.

Here, depending on hardware, 2a might be more or less crucial for GPU
performance. 2b somewhat ties into 2a but otherwise does not affect gpu
performance.

This is why we've been using dma_map_sg() for these ranges, since it is
assumed the benefits gained from 2) above by far outweighs any benefit
from finer-granularity dma-mappings on the rarer occasion of faults.
Are there other benefits from single-page dma mappings that you think
we need to consider here?

Second, when pre-faulting a range like this, the mmu interval notifier
seqno comes into play, until the gpu ptes for the prefault range are
safely in place. Now if an invalidation happens in a completely
separate part of the mirror range, it will bump the seqno and force us
to rerun the fault processing unnecessarily. Hence, for this purpose we
ideally just want to get a seqno bump covering the prefault range.
That's why finer-granularity mmu_interval notifiers might be beneficial
(and then cached for future re-use of the same prefault range). This
leads me to the next question:

You mention that mmu_notifiers are expensive to register. From looking
at the code it seems *mmu_interval* notifiers are cheap unless there
are ongoing invalidations in which case using a gpu_vma-wide notifier
would block anyway? Could you clarify a bit more the cost involved
here? If we don't register these smaller-range interval notifiers, do
you think the seqno bumps from unrelated subranges would be a real
problem?

Finally the size of the pre-faulting range is something we need to
tune. Currently it is cpu vma - wide. I understand you strongly suggest
this should be avoided. Could you elaborate a bit on why this is such a
bad choice?

Thanks,
Thomas
Jason Gunthorpe April 26, 2024, noon UTC | #20
On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> First, the gpu_vma structure is something that partitions the gpu_vm
> that holds gpu-related range metadata, like what to mirror, desired gpu
> caching policies etc. These are managed (created, removed and split)
> mainly from user-space. These are stored and looked up from an rb-tree.

Except we are talking about SVA here, so all of this should not be
exposed to userspace.

> Now, when we hit a fault, we want to use hmm_range_fault() to re-
> populate the faulting PTE, but also to pre-fault a range. Using a range
> here (let's call this a prefault range for clarity) rather than to
> insert a single PTE is for multiple reasons:

I've never said to do a single range, everyone using hmm_range_fault()
has some kind of prefetch populate around algorithm.

> This is why we've been using dma_map_sg() for these ranges, since it is
> assumed the benefits gained from 

This doesn't logically follow. You need to use dma_map_page page by
page and batch that into your update mechanism.

If you use dma_map_sg you get into the world of wrongness where you
have to track ranges and invalidation has to wipe an entire range -
because you cannot do a dma unmap of a single page from a dma_map_sg
mapping. This is all the wrong way to use hmm_range_fault.

hmm_range_fault() is page table mirroring, it fundamentally must be
page-by-page. The target page table structure must have similar
properties to the MM page table - especially page by page
validate/invalidate. Meaning you cannot use dma_map_sg().

> Second, when pre-faulting a range like this, the mmu interval notifier
> seqno comes into play, until the gpu ptes for the prefault range are
> safely in place. Now if an invalidation happens in a completely
> separate part of the mirror range, it will bump the seqno and force us
> to rerun the fault processing unnecessarily. 

This is how hmm_range_fault() works. Drivers should not do hacky
things to try to "improve" this. SVA granuals should be large, maybe
not the entire MM, but still quite large. 2M is far to small.

There is a tradeoff here of slowing down the entire MM vs risking an
iteration during fault processing. We want to err toward making fault
processing slowing because fault procesing is already really slow.

> Hence, for this purpose we
> ideally just want to get a seqno bump covering the prefault range.

Ideally, but this is not something we can get for free.

> That's why finer-granularity mmu_interval notifiers might be beneficial
> (and then cached for future re-use of the same prefault range). This
> leads me to the next question:

It is not the design, please don't invent crazy special Intel things
on top of hmm_range_fault.

> You mention that mmu_notifiers are expensive to register. From looking
> at the code it seems *mmu_interval* notifiers are cheap unless there
> are ongoing invalidations in which case using a gpu_vma-wide notifier
> would block anyway? Could you clarify a bit more the cost involved
> here?

The rb tree insertions become expensive the larger the tree is. If you
have only a couple of notifiers it is reasonable.

> If we don't register these smaller-range interval notifiers, do
> you think the seqno bumps from unrelated subranges would be a real
> problem?

I don't think it is, you'd need to have a workload which was
agressively manipulating the CPU mm (which is also pretty slow). If
the workload is doing that then it also really won't like being slowed
down by the giant rb tree. 

You can't win with an argument that collisions are likely due to an
app pattern that makes alot of stress on the MM so the right response
is to make the MM slower.

> Finally the size of the pre-faulting range is something we need to
> tune. 

Correct.

> Currently it is cpu vma - wide. I understand you strongly suggest
> this should be avoided. Could you elaborate a bit on why this is such a
> bad choice?

Why would a prefetch have anything to do with a VMA? Ie your app calls
malloc() and gets a little allocation out of a giant mmap() arena -
you want to prefault the entire arena? Does that really make any
sense?

Mirroring is a huge PITA, IMHO it should be discouraged in favour of
SVA. Sadly too many CPUs still canot implement SVA.

With mirroring there is no good way for the system to predict what the
access pattern is. The only way to make this actually performant is
for userspace to explicitly manage the mirroring with some kind of
prefetching scheme to avoid faulting its accesses except in
extrodinary cases.

VMA is emphatically not a hint about what to prefetch. You should
balance your prefetching based on HW performance and related. If it is
tidy for HW to fault around a 2M granual then just do that.

Jason
Thomas Hellstrom April 26, 2024, 2:49 p.m. UTC | #21
On Fri, 2024-04-26 at 09:00 -0300, Jason Gunthorpe wrote:
> On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> > First, the gpu_vma structure is something that partitions the
> > gpu_vm
> > that holds gpu-related range metadata, like what to mirror, desired
> > gpu
> > caching policies etc. These are managed (created, removed and
> > split)
> > mainly from user-space. These are stored and looked up from an rb-
> > tree.
> 
> Except we are talking about SVA here, so all of this should not be
> exposed to userspace.

I think you are misreading. this is on the level "Mirror this region of
the cpu_vm", "prefer this region placed in VRAM", "GPU will do atomic
accesses on this region", very similar to cpu mmap / munmap and
madvise. What I'm trying to say here is that this does not directly
affect the SVA except whether to do SVA or not, and in that case what
region of the CPU mm will be mirrored, and in addition, any gpu
attributes for the mirrored region.

> 
> > Now, when we hit a fault, we want to use hmm_range_fault() to re-
> > populate the faulting PTE, but also to pre-fault a range. Using a
> > range
> > here (let's call this a prefault range for clarity) rather than to
> > insert a single PTE is for multiple reasons:
> 
> I've never said to do a single range, everyone using
> hmm_range_fault()
> has some kind of prefetch populate around algorithm.
> 
> > This is why we've been using dma_map_sg() for these ranges, since
> > it is
> > assumed the benefits gained from 
> 
> This doesn't logically follow. You need to use dma_map_page page by
> page and batch that into your update mechanism.
> 
> If you use dma_map_sg you get into the world of wrongness where you
> have to track ranges and invalidation has to wipe an entire range -
> because you cannot do a dma unmap of a single page from a dma_map_sg
> mapping. This is all the wrong way to use hmm_range_fault.
> 
> hmm_range_fault() is page table mirroring, it fundamentally must be
> page-by-page. The target page table structure must have similar
> properties to the MM page table - especially page by page
> validate/invalidate. Meaning you cannot use dma_map_sg().

To me this is purely an optimization to make the driver page-table and
hence the GPU TLB benefit from iommu coalescing / large pages and large
driver PTEs. It is true that invalidation will sometimes shoot down
large gpu ptes unnecessarily but it will not put any additional burden
on the core AFAICT. For the dma mappings themselves they aren't touched
on invalidation since zapping the gpu PTEs effectively stops any dma
accesses. The dma mappings are rebuilt on the next gpu pagefault,
which, as you mention, are considered slow anyway, but will probably
still reuse the same prefault region, hence needing to rebuild the dma
mappings anyway.

So as long as we are correct and do not adversely affect core mm, If
the gpu performance (for whatever reason) is severely hampered if
large gpu page-table-entries are not used, couldn't this be considered
left to the driver?

And a related question. What about THP pages? OK to set up a single
dma-mapping to those?


> 
> > Second, when pre-faulting a range like this, the mmu interval
> > notifier
> > seqno comes into play, until the gpu ptes for the prefault range
> > are
> > safely in place. Now if an invalidation happens in a completely
> > separate part of the mirror range, it will bump the seqno and force
> > us
> > to rerun the fault processing unnecessarily. 
> 
> This is how hmm_range_fault() works. Drivers should not do hacky
> things to try to "improve" this. SVA granuals should be large, maybe
> not the entire MM, but still quite large. 2M is far to small.
> 
> There is a tradeoff here of slowing down the entire MM vs risking an
> iteration during fault processing. We want to err toward making fault
> processing slowing because fault procesing is already really slow.
> 
> > Hence, for this purpose we
> > ideally just want to get a seqno bump covering the prefault range.
> 
> Ideally, but this is not something we can get for free.
> 
> > That's why finer-granularity mmu_interval notifiers might be
> > beneficial
> > (and then cached for future re-use of the same prefault range).
> > This
> > leads me to the next question:
> 
> It is not the design, please don't invent crazy special Intel things
> on top of hmm_range_fault.

For the record, this is not a "crazy special Intel" invention. It's the
way all GPU implementations do this so far. We're currently catching
up. If we're going to do this in another way, we fully need to
understand why it's a bad thing to do. That's why these questions are
asked.

> 
> > You mention that mmu_notifiers are expensive to register. From
> > looking
> > at the code it seems *mmu_interval* notifiers are cheap unless
> > there
> > are ongoing invalidations in which case using a gpu_vma-wide
> > notifier
> > would block anyway? Could you clarify a bit more the cost involved
> > here?
> 
> The rb tree insertions become expensive the larger the tree is. If
> you
> have only a couple of notifiers it is reasonable.
> 
> > If we don't register these smaller-range interval notifiers, do
> > you think the seqno bumps from unrelated subranges would be a real
> > problem?
> 
> I don't think it is, you'd need to have a workload which was
> agressively manipulating the CPU mm (which is also pretty slow). If
> the workload is doing that then it also really won't like being
> slowed
> down by the giant rb tree.

OK, this makes sense, and will also simplify implementation.

> 
> You can't win with an argument that collisions are likely due to an
> app pattern that makes alot of stress on the MM so the right response
> is to make the MM slower.
> 
> > Finally the size of the pre-faulting range is something we need to
> > tune. 
> 
> Correct.
> 
> > Currently it is cpu vma - wide. I understand you strongly suggest
> > this should be avoided. Could you elaborate a bit on why this is
> > such a
> > bad choice?
> 
> Why would a prefetch have anything to do with a VMA? Ie your app
> calls
> malloc() and gets a little allocation out of a giant mmap() arena -
> you want to prefault the entire arena? Does that really make any
> sense?

Personally, no it doesn't. I'd rather use some sort of fixed-size
chunk. But to rephrase, the question was more into the strong "drivers
should not be aware of the cpu mm vma structures" comment. While I
fully agree they are probably not very useful for determining the size
of gpu prefault regions, is there anything else we should be aware of
here.

Thanks,
Thomas


> 
> Mirroring is a huge PITA, IMHO it should be discouraged in favour of
> SVA. Sadly too many CPUs still canot implement SVA.
> 
> With mirroring there is no good way for the system to predict what
> the
> access pattern is. The only way to make this actually performant is
> for userspace to explicitly manage the mirroring with some kind of
> prefetching scheme to avoid faulting its accesses except in
> extrodinary cases.
> 
> VMA is emphatically not a hint about what to prefetch. You should
> balance your prefetching based on HW performance and related. If it
> is
> tidy for HW to fault around a 2M granual then just do that.
> 
> Jason
Jason Gunthorpe April 26, 2024, 4:35 p.m. UTC | #22
On Fri, Apr 26, 2024 at 04:49:26PM +0200, Thomas Hellström wrote:
> On Fri, 2024-04-26 at 09:00 -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> > > First, the gpu_vma structure is something that partitions the
> > > gpu_vm
> > > that holds gpu-related range metadata, like what to mirror, desired
> > > gpu
> > > caching policies etc. These are managed (created, removed and
> > > split)
> > > mainly from user-space. These are stored and looked up from an rb-
> > > tree.
> > 
> > Except we are talking about SVA here, so all of this should not be
> > exposed to userspace.
> 
> I think you are misreading. this is on the level "Mirror this region of
> the cpu_vm", "prefer this region placed in VRAM", "GPU will do atomic
> accesses on this region", very similar to cpu mmap / munmap and
> madvise. What I'm trying to say here is that this does not directly
> affect the SVA except whether to do SVA or not, and in that case what
> region of the CPU mm will be mirrored, and in addition, any gpu
> attributes for the mirrored region.

SVA is you bind the whole MM and device faults dynamically populate
the mirror page table. There aren't non-SVA regions. Meta data, like
you describe, is meta data for the allocation/migration mechanism, not
for the page table or that has anything to do with the SVA mirror
operation.

Yes there is another common scheme where you bind a window of CPU to a
window on the device and mirror a fixed range, but this is a quite
different thing. It is not SVA, it has a fixed range, and it is
probably bound to a single GPU VMA in a multi-VMA device page table.

SVA is not just a whole bunch of windows being dynamically created by
the OS, that is entirely the wrong mental model. It would be horrible
to expose to userspace something like that as uAPI. Any hidden SVA
granules and other implementation specific artifacts must not be made
visible to userspace!!

> > If you use dma_map_sg you get into the world of wrongness where you
> > have to track ranges and invalidation has to wipe an entire range -
> > because you cannot do a dma unmap of a single page from a dma_map_sg
> > mapping. This is all the wrong way to use hmm_range_fault.
> > 
> > hmm_range_fault() is page table mirroring, it fundamentally must be
> > page-by-page. The target page table structure must have similar
> > properties to the MM page table - especially page by page
> > validate/invalidate. Meaning you cannot use dma_map_sg().
> 
> To me this is purely an optimization to make the driver page-table and
> hence the GPU TLB benefit from iommu coalescing / large pages and large
> driver PTEs.

This is a different topic. Leon is working on improving the DMA API to
get these kinds of benifits for HMM users. dma_map_sg is not the path
to get this. Leon's work should be significantly better in terms of
optimizing IOVA contiguity for a GPU use case. You can get a
guaranteed DMA contiguity at your choosen granual level, even up to
something like 512M.

> It is true that invalidation will sometimes shoot down
> large gpu ptes unnecessarily but it will not put any additional burden
> on the core AFAICT. 

In my experience people doing performance workloads don't enable the
IOMMU due to the high performance cost, so while optimizing iommu
coalescing is sort of interesting, it is not as important as using the
APIs properly and not harming the much more common situation when
there is no iommu and there is no artificial contiguity.

> on invalidation since zapping the gpu PTEs effectively stops any dma
> accesses. The dma mappings are rebuilt on the next gpu pagefault,
> which, as you mention, are considered slow anyway, but will probably
> still reuse the same prefault region, hence needing to rebuild the dma
> mappings anyway.

This is bad too. The DMA should not remain mapped after pages have
been freed, it completely destroys the concept of IOMMU enforced DMA
security and the ACPI notion of untrusted external devices.

> So as long as we are correct and do not adversely affect core mm, If
> the gpu performance (for whatever reason) is severely hampered if
> large gpu page-table-entries are not used, couldn't this be considered
> left to the driver?

Please use the APIs properly. We are trying to improve the DMA API to
better support HMM users, and doing unnecessary things like this in
drivers is only harmful to that kind of consolidation.

There is nothing stopping getting large GPU page table entries for
large CPU page table entries.

> And a related question. What about THP pages? OK to set up a single
> dma-mapping to those?

Yes, THP is still a page and dma_map_page() will map it.
 
> > > That's why finer-granularity mmu_interval notifiers might be
> > > beneficial
> > > (and then cached for future re-use of the same prefault range).
> > > This
> > > leads me to the next question:
> > 
> > It is not the design, please don't invent crazy special Intel things
> > on top of hmm_range_fault.
> 
> For the record, this is not a "crazy special Intel" invention. It's the
> way all GPU implementations do this so far.

"all GPU implementations" you mean AMD, and AMD predates alot of the
modern versions of this infrastructure IIRC.

> > Why would a prefetch have anything to do with a VMA? Ie your app
> > calls
> > malloc() and gets a little allocation out of a giant mmap() arena -
> > you want to prefault the entire arena? Does that really make any
> > sense?
> 
> Personally, no it doesn't. I'd rather use some sort of fixed-size
> chunk. But to rephrase, the question was more into the strong "drivers
> should not be aware of the cpu mm vma structures" comment. 

But this is essentially why - there is nothing useful the driver can
possibly learn from the CPU VMA to drive
hmm_range_fault(). hmm_range_fault already has to walk the VMA's if
someday something is actually needed it needs to be integrated in a
general way, not by having the driver touch vmas directly.

Jason
Thomas Hellstrom April 29, 2024, 8:25 a.m. UTC | #23
On Fri, 2024-04-26 at 13:35 -0300, Jason Gunthorpe wrote:
> On Fri, Apr 26, 2024 at 04:49:26PM +0200, Thomas Hellström wrote:
> > On Fri, 2024-04-26 at 09:00 -0300, Jason Gunthorpe wrote:
> > > On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> > > > First, the gpu_vma structure is something that partitions the
> > > > gpu_vm
> > > > that holds gpu-related range metadata, like what to mirror,
> > > > desired
> > > > gpu
> > > > caching policies etc. These are managed (created, removed and
> > > > split)
> > > > mainly from user-space. These are stored and looked up from an
> > > > rb-
> > > > tree.
> > > 
> > > Except we are talking about SVA here, so all of this should not
> > > be
> > > exposed to userspace.
> > 
> > I think you are misreading. this is on the level "Mirror this
> > region of
> > the cpu_vm", "prefer this region placed in VRAM", "GPU will do
> > atomic
> > accesses on this region", very similar to cpu mmap / munmap and
> > madvise. What I'm trying to say here is that this does not directly
> > affect the SVA except whether to do SVA or not, and in that case
> > what
> > region of the CPU mm will be mirrored, and in addition, any gpu
> > attributes for the mirrored region.
> 
> SVA is you bind the whole MM and device faults dynamically populate
> the mirror page table. There aren't non-SVA regions. Meta data, like
> you describe, is meta data for the allocation/migration mechanism,
> not
> for the page table or that has anything to do with the SVA mirror
> operation.



> 
> Yes there is another common scheme where you bind a window of CPU to
> a
> window on the device and mirror a fixed range, but this is a quite
> different thing. It is not SVA, it has a fixed range, and it is
> probably bound to a single GPU VMA in a multi-VMA device page table.

And this above here is exactly what we're implementing, and the GPU
page-tables are populated using device faults. Regions (large) of the
mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
buffer objects.

> 
> SVA is not just a whole bunch of windows being dynamically created by
> the OS, that is entirely the wrong mental model. It would be horrible
> to expose to userspace something like that as uAPI. Any hidden SVA
> granules and other implementation specific artifacts must not be made
> visible to userspace!!

Implementation-specific artifacts are not to be made visible to user-
space.

> 
> > > If you use dma_map_sg you get into the world of wrongness where
> > > you
> > > have to track ranges and invalidation has to wipe an entire range
> > > -
> > > because you cannot do a dma unmap of a single page from a
> > > dma_map_sg
> > > mapping. This is all the wrong way to use hmm_range_fault.
> > > 
> > > hmm_range_fault() is page table mirroring, it fundamentally must
> > > be
> > > page-by-page. The target page table structure must have similar
> > > properties to the MM page table - especially page by page
> > > validate/invalidate. Meaning you cannot use dma_map_sg().
> > 
> > To me this is purely an optimization to make the driver page-table
> > and
> > hence the GPU TLB benefit from iommu coalescing / large pages and
> > large
> > driver PTEs.
> 
> This is a different topic. Leon is working on improving the DMA API
> to
> get these kinds of benifits for HMM users. dma_map_sg is not the path
> to get this. Leon's work should be significantly better in terms of
> optimizing IOVA contiguity for a GPU use case. You can get a
> guaranteed DMA contiguity at your choosen granual level, even up to
> something like 512M.
> 
> > It is true that invalidation will sometimes shoot down
> > large gpu ptes unnecessarily but it will not put any additional
> > burden
> > on the core AFAICT. 
> 
> In my experience people doing performance workloads don't enable the
> IOMMU due to the high performance cost, so while optimizing iommu
> coalescing is sort of interesting, it is not as important as using
> the
> APIs properly and not harming the much more common situation when
> there is no iommu and there is no artificial contiguity.
> 
> > on invalidation since zapping the gpu PTEs effectively stops any
> > dma
> > accesses. The dma mappings are rebuilt on the next gpu pagefault,
> > which, as you mention, are considered slow anyway, but will
> > probably
> > still reuse the same prefault region, hence needing to rebuild the
> > dma
> > mappings anyway.
> 
> This is bad too. The DMA should not remain mapped after pages have
> been freed, it completely destroys the concept of IOMMU enforced DMA
> security and the ACPI notion of untrusted external devices.

Hmm. Yes, good point.

> 
> > So as long as we are correct and do not adversely affect core mm,
> > If
> > the gpu performance (for whatever reason) is severely hampered if
> > large gpu page-table-entries are not used, couldn't this be
> > considered
> > left to the driver?
> 
> Please use the APIs properly. We are trying to improve the DMA API to
> better support HMM users, and doing unnecessary things like this in
> drivers is only harmful to that kind of consolidation.
> 
> There is nothing stopping getting large GPU page table entries for
> large CPU page table entries.
> 
> > And a related question. What about THP pages? OK to set up a single
> > dma-mapping to those?
> 
> Yes, THP is still a page and dma_map_page() will map it.

OK great. This is probably sufficient for the performance concern for
now.

Thanks,
Thomas

>  
> > > > That's why finer-granularity mmu_interval notifiers might be
> > > > beneficial
> > > > (and then cached for future re-use of the same prefault range).
> > > > This
> > > > leads me to the next question:
> > > 
> > > It is not the design, please don't invent crazy special Intel
> > > things
> > > on top of hmm_range_fault.
> > 
> > For the record, this is not a "crazy special Intel" invention. It's
> > the
> > way all GPU implementations do this so far.
> 
> "all GPU implementations" you mean AMD, and AMD predates alot of the
> modern versions of this infrastructure IIRC.
> 
> > > Why would a prefetch have anything to do with a VMA? Ie your app
> > > calls
> > > malloc() and gets a little allocation out of a giant mmap() arena
> > > -
> > > you want to prefault the entire arena? Does that really make any
> > > sense?
> > 
> > Personally, no it doesn't. I'd rather use some sort of fixed-size
> > chunk. But to rephrase, the question was more into the strong
> > "drivers
> > should not be aware of the cpu mm vma structures" comment. 
> 
> But this is essentially why - there is nothing useful the driver can
> possibly learn from the CPU VMA to drive
> hmm_range_fault(). hmm_range_fault already has to walk the VMA's if
> someday something is actually needed it needs to be integrated in a
> general way, not by having the driver touch vmas directly.
> 
> Jason
Jason Gunthorpe April 30, 2024, 5:30 p.m. UTC | #24
On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström wrote:

> > Yes there is another common scheme where you bind a window of CPU to
> > a
> > window on the device and mirror a fixed range, but this is a quite
> > different thing. It is not SVA, it has a fixed range, and it is
> > probably bound to a single GPU VMA in a multi-VMA device page table.
> 
> And this above here is exactly what we're implementing, and the GPU
> page-tables are populated using device faults. Regions (large) of the
> mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
> buffer objects.

Well, not really, if that was the case you'd have a single VMA over
the entire bound range, not dynamically create them.

A single VMA that uses hmm_range_fault() to populate the VM is
completely logical.

Having a hidden range of mm binding and then creating/destroying 2M
VMAs dynamicaly is the thing that doesn't make alot of sense.

Jason
Daniel Vetter April 30, 2024, 6:57 p.m. UTC | #25
On Tue, Apr 30, 2024 at 02:30:02PM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström wrote:
> 
> > > Yes there is another common scheme where you bind a window of CPU to
> > > a
> > > window on the device and mirror a fixed range, but this is a quite
> > > different thing. It is not SVA, it has a fixed range, and it is
> > > probably bound to a single GPU VMA in a multi-VMA device page table.
> > 
> > And this above here is exactly what we're implementing, and the GPU
> > page-tables are populated using device faults. Regions (large) of the
> > mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
> > buffer objects.
> 
> Well, not really, if that was the case you'd have a single VMA over
> the entire bound range, not dynamically create them.
> 
> A single VMA that uses hmm_range_fault() to populate the VM is
> completely logical.
> 
> Having a hidden range of mm binding and then creating/destroying 2M
> VMAs dynamicaly is the thing that doesn't make alot of sense.

I only noticed this thread now but fyi I did dig around in the
implementation and it's summarily an absolute no-go imo for multiple
reasons. It starts with this approach of trying to mirror cpu vma (which I
think originated from amdkfd) leading to all kinds of locking fun, and
then it gets substantially worse when you dig into the details.

I think until something more solid shows up you can just ignore this. I do
fully agree that for sva the main mirroring primitive needs to be page
centric, so dma_map_sg. There's a bit a question around how to make the
necessary batching efficient and the locking/mmu_interval_notifier scale
enough, but I had some long chats with Thomas and I think there's enough
option to spawn pretty much any possible upstream consensus. So I'm not
worried.

But first this needs to be page-centric in the fundamental mirroring
approach.
-Sima
Jason Gunthorpe May 1, 2024, 12:09 a.m. UTC | #26
On Tue, Apr 30, 2024 at 08:57:48PM +0200, Daniel Vetter wrote:
> On Tue, Apr 30, 2024 at 02:30:02PM -0300, Jason Gunthorpe wrote:
> > On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström wrote:
> > 
> > > > Yes there is another common scheme where you bind a window of CPU to
> > > > a
> > > > window on the device and mirror a fixed range, but this is a quite
> > > > different thing. It is not SVA, it has a fixed range, and it is
> > > > probably bound to a single GPU VMA in a multi-VMA device page table.
> > > 
> > > And this above here is exactly what we're implementing, and the GPU
> > > page-tables are populated using device faults. Regions (large) of the
> > > mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
> > > buffer objects.
> > 
> > Well, not really, if that was the case you'd have a single VMA over
> > the entire bound range, not dynamically create them.
> > 
> > A single VMA that uses hmm_range_fault() to populate the VM is
> > completely logical.
> > 
> > Having a hidden range of mm binding and then creating/destroying 2M
> > VMAs dynamicaly is the thing that doesn't make alot of sense.
> 
> I only noticed this thread now but fyi I did dig around in the
> implementation and it's summarily an absolute no-go imo for multiple
> reasons. It starts with this approach of trying to mirror cpu vma (which I
> think originated from amdkfd) leading to all kinds of locking fun, and
> then it gets substantially worse when you dig into the details.

:(

Why does the DRM side struggle so much with hmm_range fault? I would
have thought it should have a fairly straightforward and logical
connection to the GPU page table.

FWIW, it does make sense to have both a window and a full MM option
for hmm_range_fault. ODP does both and it is fine..

> I think until something more solid shows up you can just ignore this. I do
> fully agree that for sva the main mirroring primitive needs to be page
> centric, so dma_map_sg. 
              ^^^^^^^^^^

dma_map_page

> There's a bit a question around how to make the
> necessary batching efficient and the locking/mmu_interval_notifier scale
> enough, but I had some long chats with Thomas and I think there's enough
> option to spawn pretty much any possible upstream consensus. So I'm not
> worried.

Sure, the new DMA API will bring some more considerations to this as
well. ODP uses a 512M granual scheme and it seems to be OK. By far the
worst part of all this is the faulting performance. I've yet hear any
complains about mmu notifier performance..

> But first this needs to be page-centric in the fundamental mirroring
> approach.

Yes

Jason
Daniel Vetter May 2, 2024, 8:04 a.m. UTC | #27
On Tue, Apr 30, 2024 at 09:09:15PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 30, 2024 at 08:57:48PM +0200, Daniel Vetter wrote:
> > On Tue, Apr 30, 2024 at 02:30:02PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström wrote:
> > > 
> > > > > Yes there is another common scheme where you bind a window of CPU to
> > > > > a
> > > > > window on the device and mirror a fixed range, but this is a quite
> > > > > different thing. It is not SVA, it has a fixed range, and it is
> > > > > probably bound to a single GPU VMA in a multi-VMA device page table.
> > > > 
> > > > And this above here is exactly what we're implementing, and the GPU
> > > > page-tables are populated using device faults. Regions (large) of the
> > > > mirrored CPU mm need to coexist in the same GPU vm as traditional GPU
> > > > buffer objects.
> > > 
> > > Well, not really, if that was the case you'd have a single VMA over
> > > the entire bound range, not dynamically create them.
> > > 
> > > A single VMA that uses hmm_range_fault() to populate the VM is
> > > completely logical.
> > > 
> > > Having a hidden range of mm binding and then creating/destroying 2M
> > > VMAs dynamicaly is the thing that doesn't make alot of sense.
> > 
> > I only noticed this thread now but fyi I did dig around in the
> > implementation and it's summarily an absolute no-go imo for multiple
> > reasons. It starts with this approach of trying to mirror cpu vma (which I
> > think originated from amdkfd) leading to all kinds of locking fun, and
> > then it gets substantially worse when you dig into the details.
> 
> :(
> 
> Why does the DRM side struggle so much with hmm_range fault? I would
> have thought it should have a fairly straightforward and logical
> connection to the GPU page table.

Short summary is that traditionally gpu memory was managed with buffer
objects, and each individual buffer object owns the page tables for it's
va range.

For hmm you don't have that buffer object, and you want the pagetables to
be fairly independent (maybe even with their own locking like linux cpu
pagetables do) from any mapping/backing storage. Getting to that world is
a lot of reshuffling, and so thus far all the code went with the quick
hack route of creating ad-hoc ranges that look like buffer objects to the
rest of the driver code. This includes the merged amdkfd hmm code, and if
you dig around in that it results in some really annoying locking
inversions because that middle layer of fake buffer object lookalikes only
gets in the way and results in a fairly fundamental impendendence mismatch
with core linux mm locking.

> FWIW, it does make sense to have both a window and a full MM option
> for hmm_range_fault. ODP does both and it is fine..
> 
> > I think until something more solid shows up you can just ignore this. I do
> > fully agree that for sva the main mirroring primitive needs to be page
> > centric, so dma_map_sg. 
>               ^^^^^^^^^^
> 
> dma_map_page

Oops yes.

> > There's a bit a question around how to make the
> > necessary batching efficient and the locking/mmu_interval_notifier scale
> > enough, but I had some long chats with Thomas and I think there's enough
> > option to spawn pretty much any possible upstream consensus. So I'm not
> > worried.
> 
> Sure, the new DMA API will bring some more considerations to this as
> well. ODP uses a 512M granual scheme and it seems to be OK. By far the
> worst part of all this is the faulting performance. I've yet hear any
> complains about mmu notifier performance..

Yeah I don't expect there to be any need for performance improvements on
the mmu notifier side of things. All the concerns I've heard felt rather
theoretical, or where just fallout of that fake buffer object layer in the
middle.

At worst I guess the gpu pagetables need per-pgtable locking like the cpu
pagetables have, and then maybe keep track of mmu notifier sequence
numbers on a per-pgtable basis, so that invalidates and faults on
different va ranges have no impact on each another. But even that is most
likely way, way down the road.

> > But first this needs to be page-centric in the fundamental mirroring
> > approach.
> 
> Yes

Ok clarifying consensus on this was the main reason I replied, it felt a
bit like the thread was derailing in details that don't yet matter.

Thanks, Sima
Thomas Hellstrom May 2, 2024, 9:11 a.m. UTC | #28
Hi,

Although I haven't had a chance yet to eye through the current code,
some comments along the way.


On Thu, 2024-05-02 at 10:04 +0200, Daniel Vetter wrote:
> On Tue, Apr 30, 2024 at 09:09:15PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 30, 2024 at 08:57:48PM +0200, Daniel Vetter wrote:
> > > On Tue, Apr 30, 2024 at 02:30:02PM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Apr 29, 2024 at 10:25:48AM +0200, Thomas Hellström
> > > > wrote:
> > > > 
> > > > > > Yes there is another common scheme where you bind a window
> > > > > > of CPU to
> > > > > > a
> > > > > > window on the device and mirror a fixed range, but this is
> > > > > > a quite
> > > > > > different thing. It is not SVA, it has a fixed range, and
> > > > > > it is
> > > > > > probably bound to a single GPU VMA in a multi-VMA device
> > > > > > page table.
> > > > > 
> > > > > And this above here is exactly what we're implementing, and
> > > > > the GPU
> > > > > page-tables are populated using device faults. Regions
> > > > > (large) of the
> > > > > mirrored CPU mm need to coexist in the same GPU vm as
> > > > > traditional GPU
> > > > > buffer objects.
> > > > 
> > > > Well, not really, if that was the case you'd have a single VMA
> > > > over
> > > > the entire bound range, not dynamically create them.
> > > > 
> > > > A single VMA that uses hmm_range_fault() to populate the VM is
> > > > completely logical.
> > > > 
> > > > Having a hidden range of mm binding and then
> > > > creating/destroying 2M
> > > > VMAs dynamicaly is the thing that doesn't make alot of sense.
> > > 
> > > I only noticed this thread now but fyi I did dig around in the
> > > implementation and it's summarily an absolute no-go imo for
> > > multiple
> > > reasons. It starts with this approach of trying to mirror cpu vma
> > > (which I
> > > think originated from amdkfd) leading to all kinds of locking
> > > fun, and
> > > then it gets substantially worse when you dig into the details.

It's true the cpu vma lookup is a remnant from amdkfd. The idea here is
to replace that with fixed prefaulting ranges of tunable size. So far,
as you mention, the prefaulting range has been determined by the CPU
vma size. Given previous feedback, this is going to change.

Still the prefaulting range needs to be restricted to avoid -EFAULT
failures in hmm_range_fault(). That can ofc be done by calling it
without HMM_PFN_REQ_FAULT for the range and interpret the returned
pnfs. There is a performance concern of this approach as compared to
peeking at the CPU vmas directly, since hmm_range_fault() would need to
be called twice. Any guidelines ideas here?

The second aspect of this is gpu_vma creation / splitting on fault that
the current implementation has. The plan is to get rid of that as well,
in favour of a sparsely populated gpu_vma. The reason for this
implementation was the easy integration with drm_gpuvm.

Still, I don't see any locking problems here, though, maintaining

gpu_vm->lock
mmap_lock
dma_resv
reclaim
notifier_lock

throughout the code. What is likely to get somewhat problematic,
though, is VRAM eviction.

/Thomas




> 
> s the main reason I replied, it felt a
> bit like the thread was derailing in details that don't yet matter.
> 
> Thanks, Sima

/Thomas
Jason Gunthorpe May 2, 2024, 12:46 p.m. UTC | #29
On Thu, May 02, 2024 at 11:11:04AM +0200, Thomas Hellström wrote:

> It's true the cpu vma lookup is a remnant from amdkfd. The idea here is
> to replace that with fixed prefaulting ranges of tunable size. So far,
> as you mention, the prefaulting range has been determined by the CPU
> vma size. Given previous feedback, this is going to change.

Perhaps limiting prefault to a VMA barrier is a reasonable thing to
do, but the implementation should be pushed into hmm_range_fault and
not open coded in the driver.

> Still the prefaulting range needs to be restricted to avoid -EFAULT
> failures in hmm_range_fault(). That can ofc be done by calling it
> without HMM_PFN_REQ_FAULT for the range and interpret the returned
> pnfs. 

Yes, this is exactly what that feature is for, you mark your prefetch
differently from the fault critical page(s).

> There is a performance concern of this approach as compared to
> peeking at the CPU vmas directly, since hmm_range_fault() would need to
> be called twice. Any guidelines ideas here?

If there is something wrong with hmm_range_fault() then please fix
it. I'm not sure why you'd call it twice, the HMM_PFN_REQ_FAULT is per
PFN?

Jason
Thomas Hellstrom May 2, 2024, 3:01 p.m. UTC | #30
On Thu, 2024-05-02 at 09:46 -0300, Jason Gunthorpe wrote:
> On Thu, May 02, 2024 at 11:11:04AM +0200, Thomas Hellström wrote:
> 
> > It's true the cpu vma lookup is a remnant from amdkfd. The idea
> > here is
> > to replace that with fixed prefaulting ranges of tunable size. So
> > far,
> > as you mention, the prefaulting range has been determined by the
> > CPU
> > vma size. Given previous feedback, this is going to change.
> 
> Perhaps limiting prefault to a VMA barrier is a reasonable thing to
> do, but the implementation should be pushed into hmm_range_fault and
> not open coded in the driver.
> 
> > Still the prefaulting range needs to be restricted to avoid -EFAULT
> > failures in hmm_range_fault(). That can ofc be done by calling it
> > without HMM_PFN_REQ_FAULT for the range and interpret the returned
> > pnfs. 
> 
> Yes, this is exactly what that feature is for, you mark your prefetch
> differently from the fault critical page(s).
> 
> > There is a performance concern of this approach as compared to
> > peeking at the CPU vmas directly, since hmm_range_fault() would
> > need to
> > be called twice. Any guidelines ideas here?
> 
> If there is something wrong with hmm_range_fault() then please fix
> it. I'm not sure why you'd call it twice, the HMM_PFN_REQ_FAULT is
> per
> PFN?

Ah, yes you're right. I somehow thought it was per range. Makes sense
now.

Thanks,
Thomas



> 
> Jason
Oak Zeng May 2, 2024, 7:25 p.m. UTC | #31
Hi Jason,

I tried to understand how you supposed us to use hmm range fault... it seems you want us to call hmm range fault two times on each gpu page fault:

1.
Call Hmm_range_fault first time, pfn of the fault address is set with HMM_PFN_REQ_FAULT
Other pfns in the PREFAULT_SIZE range will be set as 0
Hmm_range fault returns:
	Pfn with 0 flag or HMM_PFN_VALID flag, means a valid pfn
	Pfn with HMM_PFN_ERROR flag, means invalid pfn

2.	
Then call hmm_range_fault a second time
Setting the hmm_range start/end only to cover valid pfns
With all valid pfns, set the REQ_FAULT flag


Basically use hmm_range_fault to figure out the valid address range in the first round; then really fault (e.g., trigger cpu fault to allocate system pages) in the second call the hmm range fault.

Do I understand it correctly?

This is strange to me. We should already know the valid address range before we call hmm range fault, because the migration codes need to look up cpu vma anyway. what is the point of the first hmm_range fault?

Oak

> -----Original Message-----
> From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Sent: Thursday, May 2, 2024 11:02 AM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>; Zeng, Oak <oak.zeng@intel.com>; dri-
> devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; Brost,
> Matthew <matthew.brost@intel.com>; Welty, Brian
> <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky
> <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Thu, 2024-05-02 at 09:46 -0300, Jason Gunthorpe wrote:
> > On Thu, May 02, 2024 at 11:11:04AM +0200, Thomas Hellström wrote:
> >
> > > It's true the cpu vma lookup is a remnant from amdkfd. The idea
> > > here is
> > > to replace that with fixed prefaulting ranges of tunable size. So
> > > far,
> > > as you mention, the prefaulting range has been determined by the
> > > CPU
> > > vma size. Given previous feedback, this is going to change.
> >
> > Perhaps limiting prefault to a VMA barrier is a reasonable thing to
> > do, but the implementation should be pushed into hmm_range_fault and
> > not open coded in the driver.
> >
> > > Still the prefaulting range needs to be restricted to avoid -EFAULT
> > > failures in hmm_range_fault(). That can ofc be done by calling it
> > > without HMM_PFN_REQ_FAULT for the range and interpret the returned
> > > pnfs.
> >
> > Yes, this is exactly what that feature is for, you mark your prefetch
> > differently from the fault critical page(s).
> >
> > > There is a performance concern of this approach as compared to
> > > peeking at the CPU vmas directly, since hmm_range_fault() would
> > > need to
> > > be called twice. Any guidelines ideas here?
> >
> > If there is something wrong with hmm_range_fault() then please fix
> > it. I'm not sure why you'd call it twice, the HMM_PFN_REQ_FAULT is
> > per
> > PFN?
> 
> Ah, yes you're right. I somehow thought it was per range. Makes sense
> now.
> 
> Thanks,
> Thomas
> 
> 
> 
> >
> > Jason
Jason Gunthorpe May 3, 2024, 1:37 p.m. UTC | #32
On Thu, May 02, 2024 at 07:25:50PM +0000, Zeng, Oak wrote:
> Hi Jason,
> 
> I tried to understand how you supposed us to use hmm range fault... it seems you want us to call hmm range fault two times on each gpu page fault:
 
> 1.
> Call Hmm_range_fault first time, pfn of the fault address is set with HMM_PFN_REQ_FAULT
> Other pfns in the PREFAULT_SIZE range will be set as 0
> Hmm_range fault returns:
> 	Pfn with 0 flag or HMM_PFN_VALID flag, means a valid pfn
> 	Pfn with HMM_PFN_ERROR flag, means invalid pfn
> 
> 2.	
> Then call hmm_range_fault a second time
> Setting the hmm_range start/end only to cover valid pfns
> With all valid pfns, set the REQ_FAULT flag

Why would you do this? The first already did the faults you needed and
returned all the easy pfns that don't require faulting.

> Basically use hmm_range_fault to figure out the valid address range
> in the first round; then really fault (e.g., trigger cpu fault to
> allocate system pages) in the second call the hmm range fault.

You don't fault on prefetch. Prefetch is about mirroring already
populated pages, it should not be causing new faults.

> Do I understand it correctly?

No
 
> This is strange to me. We should already know the valid address
> range before we call hmm range fault, because the migration codes
> need to look up cpu vma anyway. what is the point of the first
> hmm_range fault?

I don't really understand why the GPU driver would drive migration off
of faulting. It doesn't make alot of sense, especially if you are
prefetching CPU pages into the GPU and thus won't get faults for them.

If your plan is to leave the GPU page tables unpopulated and then
migrate on every fault to try to achieve some kind of locality then
you'd want to drive the hmm prefetch on the migration window (so you
don't populate unmigrated pages) and hope for the best.

However, the migration stuff should really not be in the driver
either. That should be core DRM logic to manage that. It is so
convoluted and full of policy that all the drivers should be working
in the same way. 

The GPU fault handler should indicate to some core DRM function that a
GPU memory access occured and get back a prefetch window to pass into
hmm_range_fault. The driver will mirror what the core code tells it.

Jason
Oak Zeng May 3, 2024, 2:43 p.m. UTC | #33
> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, May 3, 2024 9:38 AM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Daniel Vetter
> <daniel@ffwll.ch>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; Brost, Matthew <matthew.brost@intel.com>;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky
> <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Thu, May 02, 2024 at 07:25:50PM +0000, Zeng, Oak wrote:
> > Hi Jason,
> >
> > I tried to understand how you supposed us to use hmm range fault... it
> seems you want us to call hmm range fault two times on each gpu page fault:
> 
> > 1.
> > Call Hmm_range_fault first time, pfn of the fault address is set with
> HMM_PFN_REQ_FAULT
> > Other pfns in the PREFAULT_SIZE range will be set as 0
> > Hmm_range fault returns:
> > 	Pfn with 0 flag or HMM_PFN_VALID flag, means a valid pfn
> > 	Pfn with HMM_PFN_ERROR flag, means invalid pfn
> >
> > 2.
> > Then call hmm_range_fault a second time
> > Setting the hmm_range start/end only to cover valid pfns
> > With all valid pfns, set the REQ_FAULT flag
> 
> Why would you do this? The first already did the faults you needed and
> returned all the easy pfns that don't require faulting.

But we have use case where we want to fault-in pages other than the page which contains the GPU fault address, e.g., user malloc'ed or mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU access it. Let's say GPU access caused a GPU page fault a 2MiB place. The first hmm-range-fault would only fault-in the page at 2MiB place, because in the first call we only set REQ_FAULT to the pfn at 2MiB place. 

In such case, we would go over all the pfns returned from the first hmm-range-fault to learn which pfn is a faultable page but not fault-in yet (pfn flag ==  0), which pfns is not possible to fault-in in the future (pfn flag == HMM_PFN_ERROR), then call hmm-range-fault again by setting REQ_FAULT to all faultable pages.

> 
> > Basically use hmm_range_fault to figure out the valid address range
> > in the first round; then really fault (e.g., trigger cpu fault to
> > allocate system pages) in the second call the hmm range fault.
> 
> You don't fault on prefetch. Prefetch is about mirroring already
> populated pages, it should not be causing new faults.

Maybe there is different wording here. We have this scenario we call it prefetch, or whatever you call it:

GPU page fault at an address A, we want to map an address range (e.g., 2MiB, or whatever size depending on setting) around address A to GPU page table. The range around A could have no backing pages when GPU page fault happens. We want to populate the 2MiB range. We can call it prefetch because most of pages in this range is not accessed by GPU yet, but we expect GPU to access it soon.

You mentioned "already populated pages". Who populated those pages then? Is it a CPU access populated them? If CPU access those pages first, it is true pages can be already populated. But it is also a valid use case where GPU access address before CPU so there is no "already populated pages" on GPU page fault. Please let us know what is the picture in your head. We seem picture it completely differently.



> 
> > Do I understand it correctly?
> 
> No
> 
> > This is strange to me. We should already know the valid address
> > range before we call hmm range fault, because the migration codes
> > need to look up cpu vma anyway. what is the point of the first
> > hmm_range fault?
> 
> I don't really understand why the GPU driver would drive migration off
> of faulting. It doesn't make alot of sense, especially if you are
> prefetching CPU pages into the GPU and thus won't get faults for them.
> 

Migration on GPU fault is definitely what we want to do. Not like RDMA case, GPU has its own device memory. The size of the device memory is comparable to the size of CPU system memory, sometimes bigger. We leverage significantly on device memory for performance purpose. This is why HMM introduce the MEMORY_DEVICE_PRIVATE memory type.

On GPU page fault, driver decides whether we need to migrate pages to device memory depending on a lot of factors, such as user hints, atomic correctness requirement ect. We could migrate, or we could leave pages in CPU system memory, all tuned for performance. 


> If your plan is to leave the GPU page tables unpopulated and then
> migrate on every fault to try to achieve some kind of locality then
> you'd want to drive the hmm prefetch on the migration window (so you
> don't populate unmigrated pages) and hope for the best.


Exactly what did. We decide the migration window by: 

1) look up the CPU VMA which contains the GPU fault address
2) decide a migration window per migration granularity (e.g., 2MiB) settings, inside the CPU VMA. If CPU VMA is smaller than the migration granular, migration window is the whole CPU vma range; otherwise, partially of the VMA range is migrated.

We then prefetch the migration window. If migration happened, it is true all pages are already populated. But there is use cases where migration is skipped and we want fault-in through hmm-range-fault, see explanation above. 

> 
> However, the migration stuff should really not be in the driver
> either. That should be core DRM logic to manage that. It is so
> convoluted and full of policy that all the drivers should be working
> in the same way.

Completely agreed. Moving migration infrastructures to DRM is part of our plan. We want to first prove of concept with xekmd driver, then move helpers, infrastructures to DRM. Driver should be as easy as implementation a few callback functions for device specific page table programming and device migration, and calling some DRM common functions during gpu page fault.
 

> 
> The GPU fault handler should indicate to some core DRM function that a
> GPU memory access occured and get back a prefetch window to pass into
> hmm_range_fault. The driver will mirror what the core code tells it.

No objections to this approach.

Oak

> 
> Jason
Jason Gunthorpe May 3, 2024, 4:28 p.m. UTC | #34
On Fri, May 03, 2024 at 02:43:19PM +0000, Zeng, Oak wrote:
> > > 2.
> > > Then call hmm_range_fault a second time
> > > Setting the hmm_range start/end only to cover valid pfns
> > > With all valid pfns, set the REQ_FAULT flag
> > 
> > Why would you do this? The first already did the faults you needed and
> > returned all the easy pfns that don't require faulting.
> 
> But we have use case where we want to fault-in pages other than the
> page which contains the GPU fault address, e.g., user malloc'ed or
> mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU
> access it. Let's say GPU access caused a GPU page fault a 2MiB
> place. The first hmm-range-fault would only fault-in the page at
> 2MiB place, because in the first call we only set REQ_FAULT to the
> pfn at 2MiB place.

Honestly, that doesn't make alot of sense to me, but if you really
want that you should add some new flag and have hmm_range_fault do
this kind of speculative faulting. I think you will end up
significantly over faulting.

It also doesn't make sense to do faulting in hmm prefetch if you are
going to do migration to force the fault anyhow.


> > > Basically use hmm_range_fault to figure out the valid address range
> > > in the first round; then really fault (e.g., trigger cpu fault to
> > > allocate system pages) in the second call the hmm range fault.
> > 
> > You don't fault on prefetch. Prefetch is about mirroring already
> > populated pages, it should not be causing new faults.
> 
> Maybe there is different wording here. We have this scenario we call
> it prefetch, or whatever you call it:
>
> GPU page fault at an address A, we want to map an address range
> (e.g., 2MiB, or whatever size depending on setting) around address A
> to GPU page table. The range around A could have no backing pages
> when GPU page fault happens. We want to populate the 2MiB range. We
> can call it prefetch because most of pages in this range is not
> accessed by GPU yet, but we expect GPU to access it soon.

This isn't prefetch, that is prefaulting.
 
> You mentioned "already populated pages". Who populated those pages
> then? Is it a CPU access populated them? If CPU access those pages
> first, it is true pages can be already populated. 

Yes, I would think that is a pretty common case

> But it is also a valid use case where GPU access address before CPU
> so there is no "already populated pages" on GPU page fault. Please
> let us know what is the picture in your head. We seem picture it
> completely differently.

And sure, this could happen too, but I feel like it is an application
issue to be not prefaulting the buffers it knows the GPU is going to
touch.

Again, our experiments have shown that taking the fault path is so
slow that sane applications must explicitly prefault and prefetch as
much as possible to avoid the faults in the first place.

I'm not sure I full agree there is a real need to agressively optimize
the faulting path like you are describing when it shouldn't really be
used in a performant application :\

> 2) decide a migration window per migration granularity (e.g., 2MiB)
> settings, inside the CPU VMA. If CPU VMA is smaller than the
> migration granular, migration window is the whole CPU vma range;
> otherwise, partially of the VMA range is migrated.

Seems rather arbitary to me. You are quite likely to capture some
memory that is CPU memory and cause thrashing. As I said before in
common cases the heap will be large single VMAs, so this kind of
scheme is just going to fault a whole bunch of unrelated malloc
objects over to the GPU.

Not sure how it is really a good idea.

Adaptive locality of memory is still an unsolved problem in Linux,
sadly.

> > However, the migration stuff should really not be in the driver
> > either. That should be core DRM logic to manage that. It is so
> > convoluted and full of policy that all the drivers should be working
> > in the same way.
> 
> Completely agreed. Moving migration infrastructures to DRM is part
> of our plan. We want to first prove of concept with xekmd driver,
> then move helpers, infrastructures to DRM. Driver should be as easy
> as implementation a few callback functions for device specific page
> table programming and device migration, and calling some DRM common
> functions during gpu page fault.

You'd be better to start out this way so people can look at and
understand the core code on its own merits.

Jason
Oak Zeng May 3, 2024, 8:29 p.m. UTC | #35
> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, May 3, 2024 12:28 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>; Daniel Vetter
> <daniel@ffwll.ch>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; Brost, Matthew <matthew.brost@intel.com>;
> Welty, Brian <brian.welty@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@intel.com>; Leon Romanovsky
> <leon@kernel.org>
> Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table
> from hmm range
> 
> On Fri, May 03, 2024 at 02:43:19PM +0000, Zeng, Oak wrote:
> > > > 2.
> > > > Then call hmm_range_fault a second time
> > > > Setting the hmm_range start/end only to cover valid pfns
> > > > With all valid pfns, set the REQ_FAULT flag
> > >
> > > Why would you do this? The first already did the faults you needed and
> > > returned all the easy pfns that don't require faulting.
> >
> > But we have use case where we want to fault-in pages other than the
> > page which contains the GPU fault address, e.g., user malloc'ed or
> > mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU
> > access it. Let's say GPU access caused a GPU page fault a 2MiB
> > place. The first hmm-range-fault would only fault-in the page at
> > 2MiB place, because in the first call we only set REQ_FAULT to the
> > pfn at 2MiB place.
> 
> Honestly, that doesn't make alot of sense to me, but if you really
> want that you should add some new flag and have hmm_range_fault do
> this kind of speculative faulting. I think you will end up
> significantly over faulting.

Above 2 steps hmm-range-fault approach is just my guess of what you were suggesting. Since you don't like the CPU vma look up, so we come out this 2 steps hmm-range-fault thing. The first step has the same functionality of a CPU vma lookup.

I also think this approach doesn't make sense.

In our original approach, we lookup cpu vma before migration. Calling hmm-range-fault in a non-speculative way, and there is no overfaulting, because we only call hmm-range-fault within a valid range that we get from CPU vma.

> 
> It also doesn't make sense to do faulting in hmm prefetch if you are
> going to do migration to force the fault anyhow.

What do you mean by hmm prefetch?

As explained, we call hmm-range-fault in two scenarios:

1) call hmm-range-fault to get the current status of cpu page table without causing CPU fault. When address range is already accessed by CPU before GPU, or when we migrate for such range, we run into this scenario

2) when CPU never access range and driver determined there is no need to migrate, we call hmm-range-fault to trigger cpu fault and allocate system pages for this range.

> 
> 
> > > > Basically use hmm_range_fault to figure out the valid address range
> > > > in the first round; then really fault (e.g., trigger cpu fault to
> > > > allocate system pages) in the second call the hmm range fault.
> > >
> > > You don't fault on prefetch. Prefetch is about mirroring already
> > > populated pages, it should not be causing new faults.
> >
> > Maybe there is different wording here. We have this scenario we call
> > it prefetch, or whatever you call it:
> >
> > GPU page fault at an address A, we want to map an address range
> > (e.g., 2MiB, or whatever size depending on setting) around address A
> > to GPU page table. The range around A could have no backing pages
> > when GPU page fault happens. We want to populate the 2MiB range. We
> > can call it prefetch because most of pages in this range is not
> > accessed by GPU yet, but we expect GPU to access it soon.
> 
> This isn't prefetch, that is prefaulting.

Sure, prefaulting is a better name. 

We do have another prefetch API which can be called from user space to prefetch before GPU job submission.


> 
> > You mentioned "already populated pages". Who populated those pages
> > then? Is it a CPU access populated them? If CPU access those pages
> > first, it is true pages can be already populated.
> 
> Yes, I would think that is a pretty common case
> 
> > But it is also a valid use case where GPU access address before CPU
> > so there is no "already populated pages" on GPU page fault. Please
> > let us know what is the picture in your head. We seem picture it
> > completely differently.
> 
> And sure, this could happen too, but I feel like it is an application
> issue to be not prefaulting the buffers it knows the GPU is going to
> touch.
> 
> Again, our experiments have shown that taking the fault path is so
> slow that sane applications must explicitly prefault and prefetch as
> much as possible to avoid the faults in the first place.

I agree fault path has a huge overhead. We all agree.


> 
> I'm not sure I full agree there is a real need to agressively optimize
> the faulting path like you are describing when it shouldn't really be
> used in a performant application :\

As a driver, we need to support all possible scenarios. Our way of using hmm-range-fault is just generalized enough to deal with both situation: when application is smart enough to prefetch/prefault, sure hmm-range-fault just get back the existing pfns; otherwise it falls back to the slow faulting path.

It is not an aggressive optimization. The codes is written for fast path but it also works for slow path.


> 
> > 2) decide a migration window per migration granularity (e.g., 2MiB)
> > settings, inside the CPU VMA. If CPU VMA is smaller than the
> > migration granular, migration window is the whole CPU vma range;
> > otherwise, partially of the VMA range is migrated.
> 
> Seems rather arbitary to me. You are quite likely to capture some
> memory that is CPU memory and cause thrashing. As I said before in
> common cases the heap will be large single VMAs, so this kind of
> scheme is just going to fault a whole bunch of unrelated malloc
> objects over to the GPU.

I want to listen more here.

Here is my understanding. Malloc of small size such as less than one page, or a few pages, memory is allocated from heap.

When malloc is much more than one pages, the GlibC's behavior is mmap it directly from OS, vs from heap

In glibC the threshold is defined by MMAP_THRESHOLD. The default value is 128K: https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html

So on the heap, it is some small VMAs each contains a few pages, normally one page per VMA. In the worst case, VMA on pages shouldn't be bigger than MMAP_THRESHOLD.

In a reasonable GPU application, people use GPU for compute which usually involves large amount of data which can be many MiB, sometimes it can even be many GiB of data

Now going back our scheme. I picture in most application, the CPU vma search end up big vma, MiB, GiB etc

If we end up with a vma that is only a few pages, we fault in the whole vma. It is true that for this case we fault in unrelated malloc objects. Maybe we can fine tune here to only fault in one page (which is minimum fault size) for such case. Admittedly one page can also have bunch of unrelated objects. But overall we think this should not be  common case.

Let me know if this understanding is correct.

Or what would you like to do in such situation?

> 
> Not sure how it is really a good idea.
> 
> Adaptive locality of memory is still an unsolved problem in Linux,
> sadly.
> 
> > > However, the migration stuff should really not be in the driver
> > > either. That should be core DRM logic to manage that. It is so
> > > convoluted and full of policy that all the drivers should be working
> > > in the same way.
> >
> > Completely agreed. Moving migration infrastructures to DRM is part
> > of our plan. We want to first prove of concept with xekmd driver,
> > then move helpers, infrastructures to DRM. Driver should be as easy
> > as implementation a few callback functions for device specific page
> > table programming and device migration, and calling some DRM common
> > functions during gpu page fault.
> 
> You'd be better to start out this way so people can look at and
> understand the core code on its own merits.

The two steps way were agreed with DRM maintainers, see here:  https://lore.kernel.org/dri-devel/SA1PR11MB6991045CC69EC8E1C576A715925F2@SA1PR11MB6991.namprd11.prod.outlook.com/, bullet 4)


Oak

> 
> Jason
Dave Airlie May 4, 2024, 1:03 a.m. UTC | #36
> Let me know if this understanding is correct.
>
> Or what would you like to do in such situation?
>
> >
> > Not sure how it is really a good idea.
> >
> > Adaptive locality of memory is still an unsolved problem in Linux,
> > sadly.
> >
> > > > However, the migration stuff should really not be in the driver
> > > > either. That should be core DRM logic to manage that. It is so
> > > > convoluted and full of policy that all the drivers should be working
> > > > in the same way.
> > >
> > > Completely agreed. Moving migration infrastructures to DRM is part
> > > of our plan. We want to first prove of concept with xekmd driver,
> > > then move helpers, infrastructures to DRM. Driver should be as easy
> > > as implementation a few callback functions for device specific page
> > > table programming and device migration, and calling some DRM common
> > > functions during gpu page fault.
> >
> > You'd be better to start out this way so people can look at and
> > understand the core code on its own merits.
>
> The two steps way were agreed with DRM maintainers, see here:  https://lore.kernel.org/dri-devel/SA1PR11MB6991045CC69EC8E1C576A715925F2@SA1PR11MB6991.namprd11.prod.outlook.com/, bullet 4)

After this discussion and the other cross-device HMM stuff I think we
should probably push more for common up-front, I think doing this in a
driver without considering the bigger picture might not end up
extractable, and then I fear the developers will just move onto other
things due to management pressure to land features over correctness.

I think we have enough people on the list that can review this stuff,
and even if the common code ends up being a little xe specific,
iterating it will be easier outside the driver, as we can clearly
demark what is inside and outside.

Dave.
Daniel Vetter May 6, 2024, 1:04 p.m. UTC | #37
On Sat, May 04, 2024 at 11:03:03AM +1000, Dave Airlie wrote:
> > Let me know if this understanding is correct.
> >
> > Or what would you like to do in such situation?
> >
> > >
> > > Not sure how it is really a good idea.
> > >
> > > Adaptive locality of memory is still an unsolved problem in Linux,
> > > sadly.
> > >
> > > > > However, the migration stuff should really not be in the driver
> > > > > either. That should be core DRM logic to manage that. It is so
> > > > > convoluted and full of policy that all the drivers should be working
> > > > > in the same way.
> > > >
> > > > Completely agreed. Moving migration infrastructures to DRM is part
> > > > of our plan. We want to first prove of concept with xekmd driver,
> > > > then move helpers, infrastructures to DRM. Driver should be as easy
> > > > as implementation a few callback functions for device specific page
> > > > table programming and device migration, and calling some DRM common
> > > > functions during gpu page fault.
> > >
> > > You'd be better to start out this way so people can look at and
> > > understand the core code on its own merits.
> >
> > The two steps way were agreed with DRM maintainers, see here:  https://lore.kernel.org/dri-devel/SA1PR11MB6991045CC69EC8E1C576A715925F2@SA1PR11MB6991.namprd11.prod.outlook.com/, bullet 4)
> 
> After this discussion and the other cross-device HMM stuff I think we
> should probably push more for common up-front, I think doing this in a
> driver without considering the bigger picture might not end up
> extractable, and then I fear the developers will just move onto other
> things due to management pressure to land features over correctness.
> 
> I think we have enough people on the list that can review this stuff,
> and even if the common code ends up being a little xe specific,
> iterating it will be easier outside the driver, as we can clearly
> demark what is inside and outside.

tldr; Yeah concurring.

I think like with the gpu vma stuff we should at least aim for the core
data structures, and more importantly, the locking design and how it
interacts with core mm services to be common code.

I read through amdkfd and I think that one is warning enough that this
area is one of these cases where going with common code aggressively is
much better. Because it will be buggy in terribly "how do we get out of
this design corner again ever?" ways no matter what. But with common code
there will at least be all of dri-devel and hopefully some mm folks
involved in sorting things out.

Most other areas it's indeed better to explore the design space with a few
drivers before going with common code, at the cost of having some really
terrible driver code in upstream. But here the cost of some really bad
design in drivers is just too expensive imo.
-Sima
Jason Gunthorpe May 6, 2024, 1:33 p.m. UTC | #38
On Fri, May 03, 2024 at 08:29:39PM +0000, Zeng, Oak wrote:

> > > But we have use case where we want to fault-in pages other than the
> > > page which contains the GPU fault address, e.g., user malloc'ed or
> > > mmap'ed 8MiB buffer, and no CPU touching of this buffer before GPU
> > > access it. Let's say GPU access caused a GPU page fault a 2MiB
> > > place. The first hmm-range-fault would only fault-in the page at
> > > 2MiB place, because in the first call we only set REQ_FAULT to the
> > > pfn at 2MiB place.
> > 
> > Honestly, that doesn't make alot of sense to me, but if you really
> > want that you should add some new flag and have hmm_range_fault do
> > this kind of speculative faulting. I think you will end up
> > significantly over faulting.
> 
> Above 2 steps hmm-range-fault approach is just my guess of what you
> were suggesting. Since you don't like the CPU vma look up, so we
> come out this 2 steps hmm-range-fault thing. The first step has the
> same functionality of a CPU vma lookup.

If you want to retain the GPU fault flag as a signal for changing
locality then you have to correct the locality and resolve all faults
before calling hmm_range_fault(). hmm_range_fault() will never do
faulting. It will always just read in the already resolved pages.

> > It also doesn't make sense to do faulting in hmm prefetch if you are
> > going to do migration to force the fault anyhow.
> 
> What do you mean by hmm prefetch?

I mean the pages that are not part of the critical fault
resultion. The pages you are preloading into the GPU page table
without an immediate need.
 
> As explained, we call hmm-range-fault in two scenarios:
> 
> 1) call hmm-range-fault to get the current status of cpu page table
> without causing CPU fault. When address range is already accessed by
> CPU before GPU, or when we migrate for such range, we run into this
> scenario

This is because you are trying to keep locality management outside of
the code code - it is creating this problem. As I said below locality
management should be core code, not in drivers. It may be hmm core
code, not drm, but regardless.
 
> We do have another prefetch API which can be called from user space
> to prefetch before GPU job submission.

This API seems like it would break the use of faulting as a mechanism
to manage locality...

> > I'm not sure I full agree there is a real need to agressively optimize
> > the faulting path like you are describing when it shouldn't really be
> > used in a performant application :\
> 
> As a driver, we need to support all possible scenarios. 

Functionally support is different from micro optimizing it.

> > > 2) decide a migration window per migration granularity (e.g., 2MiB)
> > > settings, inside the CPU VMA. If CPU VMA is smaller than the
> > > migration granular, migration window is the whole CPU vma range;
> > > otherwise, partially of the VMA range is migrated.
> > 
> > Seems rather arbitary to me. You are quite likely to capture some
> > memory that is CPU memory and cause thrashing. As I said before in
> > common cases the heap will be large single VMAs, so this kind of
> > scheme is just going to fault a whole bunch of unrelated malloc
> > objects over to the GPU.
> 
> I want to listen more here.
> 
> Here is my understanding. Malloc of small size such as less than one
> page, or a few pages, memory is allocated from heap.
> 
> When malloc is much more than one pages, the GlibC's behavior is
> mmap it directly from OS, vs from heap

Yes "much more", there is some cross over where very large allocations
may get there own arena.
 
> In glibC the threshold is defined by MMAP_THRESHOLD. The default
> value is 128K:
> https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html

Sure
 
> So on the heap, it is some small VMAs each contains a few pages,
> normally one page per VMA. In the worst case, VMA on pages shouldn't
> be bigger than MMAP_THRESHOLD.

Huh? That isn't quite how it works. The glibc arenas for < 128K
allocation can be quite big, they often will come from the brk heap
which is a single large VMA. The above only says that allocations over
128K will get their own VMAs. It doesn't say small allocations get
small VMAs.

Of course there are many allocator libraries with different schemes
and tunables.

> In a reasonable GPU application, people use GPU for compute which
> usually involves large amount of data which can be many MiB,
> sometimes it can even be many GiB of data

Then the application can also prefault the whole thing.

> Now going back our scheme. I picture in most application, the CPU
> vma search end up big vma, MiB, GiB etc

I'm not sure. Some may, but not all, and not all memory touched by the
GPU will necessarily come from the giant allocation even in the apps
that do work that way.

> If we end up with a vma that is only a few pages, we fault in the
> whole vma. It is true that for this case we fault in unrelated
> malloc objects. Maybe we can fine tune here to only fault in one
> page (which is minimum fault size) for such case. Admittedly one
> page can also have bunch of unrelated objects. But overall we think
> this should not be common case.

This is the obvious solution, without some kind of special knowledge
the kernel possibly shouldn't attempt the optimize by speculating how
to resolve the fault - or minimally the speculation needs to be a
tunable (ugh)

Broadly, I think using fault indication to indicate locality of pages
that haven't been faulted is pretty bad. Locality indications need to
come from some way that reliably indicates if the device is touching
the pages at all.

Arguably this can never be performant, so I'd argue you should focus
on making things simply work (ie single fault, no prefault, basic
prefetch) and do not expect to achieve a high quality dynamic
locality. Application must specify, application must prefault &
prefetch.

Jason
Matthew Brost May 6, 2024, 11:50 p.m. UTC | #39
On Mon, May 06, 2024 at 03:04:15PM +0200, Daniel Vetter wrote:
> On Sat, May 04, 2024 at 11:03:03AM +1000, Dave Airlie wrote:
> > > Let me know if this understanding is correct.
> > >
> > > Or what would you like to do in such situation?
> > >
> > > >
> > > > Not sure how it is really a good idea.
> > > >
> > > > Adaptive locality of memory is still an unsolved problem in Linux,
> > > > sadly.
> > > >
> > > > > > However, the migration stuff should really not be in the driver
> > > > > > either. That should be core DRM logic to manage that. It is so
> > > > > > convoluted and full of policy that all the drivers should be working
> > > > > > in the same way.
> > > > >
> > > > > Completely agreed. Moving migration infrastructures to DRM is part
> > > > > of our plan. We want to first prove of concept with xekmd driver,
> > > > > then move helpers, infrastructures to DRM. Driver should be as easy
> > > > > as implementation a few callback functions for device specific page
> > > > > table programming and device migration, and calling some DRM common
> > > > > functions during gpu page fault.
> > > >
> > > > You'd be better to start out this way so people can look at and
> > > > understand the core code on its own merits.
> > >
> > > The two steps way were agreed with DRM maintainers, see here:  https://lore.kernel.org/dri-devel/SA1PR11MB6991045CC69EC8E1C576A715925F2@SA1PR11MB6991.namprd11.prod.outlook.com/, bullet 4)
> > 
> > After this discussion and the other cross-device HMM stuff I think we
> > should probably push more for common up-front, I think doing this in a
> > driver without considering the bigger picture might not end up
> > extractable, and then I fear the developers will just move onto other
> > things due to management pressure to land features over correctness.
> > 
> > I think we have enough people on the list that can review this stuff,
> > and even if the common code ends up being a little xe specific,
> > iterating it will be easier outside the driver, as we can clearly
> > demark what is inside and outside.
> 
> tldr; Yeah concurring.
> 
> I think like with the gpu vma stuff we should at least aim for the core
> data structures, and more importantly, the locking design and how it
> interacts with core mm services to be common code.
> 

I believe this is a reasonable request and hopefully, it should end up
being a pretty thin layer. drm_gpusvm? Have some ideas. Let's see what
we come up with.

Matt

> I read through amdkfd and I think that one is warning enough that this
> area is one of these cases where going with common code aggressively is
> much better. Because it will be buggy in terribly "how do we get out of
> this design corner again ever?" ways no matter what. But with common code
> there will at least be all of dri-devel and hopefully some mm folks
> involved in sorting things out.
> 
> Most other areas it's indeed better to explore the design space with a few
> drivers before going with common code, at the cost of having some really
> terrible driver code in upstream. But here the cost of some really bad
> design in drivers is just too expensive imo.
> -Sima
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Jason Gunthorpe May 7, 2024, 11:56 a.m. UTC | #40
On Mon, May 06, 2024 at 11:50:36PM +0000, Matthew Brost wrote:
> > I think like with the gpu vma stuff we should at least aim for the core
> > data structures, and more importantly, the locking design and how it
> > interacts with core mm services to be common code.
> 
> I believe this is a reasonable request and hopefully, it should end up
> being a pretty thin layer. drm_gpusvm? Have some ideas. Let's see what
> we come up with.

It sounds to me like some of the important peices should not be in DRM
but somewhere in mm

Jason
diff mbox series

Patch

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 559188471949..ab3cc2121869 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -6,6 +6,8 @@ 
 #include <linux/mutex.h>
 #include <linux/mm_types.h>
 #include "xe_svm.h"
+#include <linux/hmm.h>
+#include <linux/scatterlist.h>
 
 DEFINE_HASHTABLE(xe_svm_table, XE_MAX_SVM_PROCESS);
 
@@ -61,3 +63,53 @@  struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm)
 
 	return NULL;
 }
+
+/**
+ * xe_svm_build_sg() - build a scatter gather table for all the physical pages/pfn
+ * in a hmm_range.
+ *
+ * @range: the hmm range that we build the sg table from. range->hmm_pfns[]
+ * has the pfn numbers of pages that back up this hmm address range.
+ * @st: pointer to the sg table.
+ *
+ * All the contiguous pfns will be collapsed into one entry in
+ * the scatter gather table. This is for the convenience of
+ * later on operations to bind address range to GPU page table.
+ *
+ * This function allocates the storage of the sg table. It is
+ * caller's responsibility to free it calling sg_free_table.
+ *
+ * Returns 0 if successful; -ENOMEM if fails to allocate memory
+ */
+int xe_svm_build_sg(struct hmm_range *range,
+			     struct sg_table *st)
+{
+	struct scatterlist *sg;
+	u64 i, npages;
+
+	sg = NULL;
+	st->nents = 0;
+	npages = ((range->end - 1) >> PAGE_SHIFT) - (range->start >> PAGE_SHIFT) + 1;
+
+	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
+		return -ENOMEM;
+
+	for (i = 0; i < npages; i++) {
+		unsigned long addr = range->hmm_pfns[i];
+
+		if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
+			sg->length += PAGE_SIZE;
+			sg_dma_len(sg) += PAGE_SIZE;
+			continue;
+		}
+
+		sg =  sg ? sg_next(sg) : st->sgl;
+		sg_dma_address(sg) = addr;
+		sg_dma_len(sg) = PAGE_SIZE;
+		sg->length = PAGE_SIZE;
+		st->nents++;
+	}
+
+	sg_mark_end(sg);
+	return 0;
+}
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index 3ed106ecc02b..191bce6425db 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -13,6 +13,8 @@ 
 #include <linux/interval_tree.h>
 #include <linux/hashtable.h>
 #include <linux/types.h>
+#include <linux/hmm.h>
+#include "xe_device_types.h"
 
 struct xe_vm;
 struct mm_struct;
@@ -69,4 +71,5 @@  struct xe_svm *xe_create_svm(struct xe_vm *vm);
 struct xe_svm *xe_lookup_svm_by_mm(struct mm_struct *mm);
 struct xe_svm_range *xe_svm_range_from_addr(struct xe_svm *svm,
 								unsigned long addr);
+int xe_svm_build_sg(struct hmm_range *range, struct sg_table *st);
 #endif