diff mbox series

[v1,03/14] mm: add iomem vma selection for memory migration

Message ID 20210825034828.12927-4-alex.sierra@amd.com (mailing list archive)
State New
Headers show
Series Add MEMORY_DEVICE_PUBLIC for CPU-accessible coherent device memory | expand

Commit Message

Sierra Guiza, Alejandro (Alex) Aug. 25, 2021, 3:48 a.m. UTC
In this case, this is used to migrate pages from device memory, back to
system memory. This particular device memory type should be accessible
by the CPU, through IOMEM access. Typically, zone device public type
memory falls into this category.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 include/linux/migrate.h | 1 +
 mm/migrate.c            | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig Aug. 25, 2021, 7:40 a.m. UTC | #1
On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:
> In this case, this is used to migrate pages from device memory, back to
> system memory. This particular device memory type should be accessible
> by the CPU, through IOMEM access. Typically, zone device public type
> memory falls into this category.
> 
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
>  include/linux/migrate.h | 1 +
>  mm/migrate.c            | 3 ++-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 4bb4e519e3f5..6b16f417384f 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -156,6 +156,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
>  enum migrate_vma_direction {
>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> +	MIGRATE_VMA_SELECT_IOMEM = 1 << 2,
>  };
>  
>  struct migrate_vma {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e3a10e2a1bb3..d4ae2da99607 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2406,7 +2406,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			if (is_write_device_private_entry(entry))
>  				mpfn |= MIGRATE_PFN_WRITE;
>  		} else {
> -			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> +			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
> +			    !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))

This makes the MIGRATE_VMA_SELECT_SYSTEM and MIGRATE_VMA_SELECT_IOMEM
behave entirely identifical, that is redundant.  I think we need to
distinguish between the dfferent cases here.  I think the right check
would be pfn_valid(), which should be true for system memory, and
false for iomem.

Also shouldn't this be called DEVICE_PUBLIC instead of IOMEM?
Christoph Hellwig Aug. 25, 2021, 7:46 a.m. UTC | #2
On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:
>  		} else {
> -			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> +			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
> +			    !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
>  				goto next;
>  			pfn = pte_pfn(pte);
>  			if (is_zero_pfn(pfn)) {

.. also how is this going to work for the device public memory?  That
should be pte_special() an thus fail vm_normal_page.
Felix Kuehling Aug. 25, 2021, 2:24 p.m. UTC | #3
Am 2021-08-24 um 11:48 p.m. schrieb Alex Sierra:
> In this case, this is used to migrate pages from device memory, back to
> system memory. This particular device memory type should be accessible
> by the CPU, through IOMEM access. Typically, zone device public type
> memory falls into this category.
>
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> ---
>  include/linux/migrate.h | 1 +
>  mm/migrate.c            | 3 ++-
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 4bb4e519e3f5..6b16f417384f 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -156,6 +156,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
>  enum migrate_vma_direction {
>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> +	MIGRATE_VMA_SELECT_IOMEM = 1 << 2,

How about calling this MIGRATE_VMA_SELECT_DEVICE_PUBLIC?


>  };
>  
>  struct migrate_vma {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e3a10e2a1bb3..d4ae2da99607 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2406,7 +2406,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			if (is_write_device_private_entry(entry))
>  				mpfn |= MIGRATE_PFN_WRITE;
>  		} else {
> -			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> +			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
> +			    !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
>  				goto next;

For MIGRATE_VMA_SELECT_IOMEM/DEVICE_PUBLIC, I think we should ensure the
pages are ZONE_DEVICE and we should also check the page owner for
symmetry with DEVICE_PRIVATE.

Regards,
  Felix


>  			pfn = pte_pfn(pte);
>  			if (is_zero_pfn(pfn)) {
Sierra Guiza, Alejandro (Alex) Aug. 25, 2021, 6:24 p.m. UTC | #4
On 8/25/2021 2:46 AM, Christoph Hellwig wrote:
> On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:
>>   		} else {
>> -			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
>> +			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
>> +			    !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
>>   				goto next;
>>   			pfn = pte_pfn(pte);
>>   			if (is_zero_pfn(pfn)) {
> .. also how is this going to work for the device public memory?  That
> should be pte_special() an thus fail vm_normal_page.
Perhaps we're missing something, as we're not doing any special marking 
for the device public pfn/entries.
pfn_valid return true, pte_special return false and pfn_t_devmap return 
false on these pages. Same as system pages.
That's the reason vm_normal_page returns the page correctly through 
pfn_to_page helper.

Regards,
Alex S.
Felix Kuehling Aug. 26, 2021, 10:27 p.m. UTC | #5
Am 2021-08-25 um 2:24 p.m. schrieb Sierra Guiza, Alejandro (Alex):
>
> On 8/25/2021 2:46 AM, Christoph Hellwig wrote:
>> On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:
>>>           } else {
>>> -            if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
>>> +            if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
>>> +                !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
>>>                   goto next;
>>>               pfn = pte_pfn(pte);
>>>               if (is_zero_pfn(pfn)) {
>> .. also how is this going to work for the device public memory?  That
>> should be pte_special() an thus fail vm_normal_page.
> Perhaps we're missing something, as we're not doing any special
> marking for the device public pfn/entries.
> pfn_valid return true, pte_special return false and pfn_t_devmap
> return false on these pages. Same as system pages.
> That's the reason vm_normal_page returns the page correctly through
> pfn_to_page helper.

Hi Christoph,

I think we're missing something here. As far as I can tell, all the work
we did first with DEVICE_GENERIC and now DEVICE_PUBLIC always used
normal pages. Are we missing something in our driver code that would
make these PTEs special? I don't understand how that can be, because
driver code is not really involved in updating the CPU mappings. Maybe
it's something we need to do in the migration helpers.

Thanks,
  Felix


>
> Regards,
> Alex S.
Christoph Hellwig Aug. 30, 2021, 8:28 a.m. UTC | #6
On Thu, Aug 26, 2021 at 06:27:31PM -0400, Felix Kuehling wrote:
> I think we're missing something here. As far as I can tell, all the work
> we did first with DEVICE_GENERIC and now DEVICE_PUBLIC always used
> normal pages. Are we missing something in our driver code that would
> make these PTEs special? I don't understand how that can be, because
> driver code is not really involved in updating the CPU mappings. Maybe
> it's something we need to do in the migration helpers.

It looks like I'm totally misunderstanding what you are adding here
then.  Why do we need any special treatment at all for memory that
has normal struct pages and is part of the direct kernel map?
Felix Kuehling Aug. 30, 2021, 5:04 p.m. UTC | #7
Am 2021-08-30 um 4:28 a.m. schrieb Christoph Hellwig:
> On Thu, Aug 26, 2021 at 06:27:31PM -0400, Felix Kuehling wrote:
>> I think we're missing something here. As far as I can tell, all the work
>> we did first with DEVICE_GENERIC and now DEVICE_PUBLIC always used
>> normal pages. Are we missing something in our driver code that would
>> make these PTEs special? I don't understand how that can be, because
>> driver code is not really involved in updating the CPU mappings. Maybe
>> it's something we need to do in the migration helpers.
> It looks like I'm totally misunderstanding what you are adding here
> then.  Why do we need any special treatment at all for memory that
> has normal struct pages and is part of the direct kernel map?

The pages are like normal memory for purposes of mapping them in CPU
page tables and for coherent access from the CPU. From an application
perspective, we want file-backed and anonymous mappings to be able to
use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
optimize performance for GPU heavy workloads while minimizing the need
to migrate data back-and-forth between system memory and device memory.

The pages are special in two ways:

 1. The memory is managed not by the Linux buddy allocator, but by the
    GPU driver's TTM memory manager
 2. We want to migrate data in response to GPU page faults and
    application hints using the migrate_vma helpers

It's the second part that we're really trying to address with this patch
series.

Regards,
  Felix
Christoph Hellwig Sept. 1, 2021, 8:29 a.m. UTC | #8
On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> >> driver code is not really involved in updating the CPU mappings. Maybe
> >> it's something we need to do in the migration helpers.
> > It looks like I'm totally misunderstanding what you are adding here
> > then.  Why do we need any special treatment at all for memory that
> > has normal struct pages and is part of the direct kernel map?
> 
> The pages are like normal memory for purposes of mapping them in CPU
> page tables and for coherent access from the CPU.

That's the user page tables.  What about the kernel direct map?
If there is a normal kernel struct page backing there really should
be no need for the pgmap.

> From an application
> perspective, we want file-backed and anonymous mappings to be able to
> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> optimize performance for GPU heavy workloads while minimizing the need
> to migrate data back-and-forth between system memory and device memory.

I don't really understand that part.  file backed pages are always
allocated by the file system using the pagecache helpers, that is
using the page allocator.  Anonymouns memory also always comes from
the page allocator.

> The pages are special in two ways:
> 
>  1. The memory is managed not by the Linux buddy allocator, but by the
>     GPU driver's TTM memory manager

Why?

>  2. We want to migrate data in response to GPU page faults and
>     application hints using the migrate_vma helpers

Why?
Felix Kuehling Sept. 1, 2021, 3:40 p.m. UTC | #9
Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
>>>> driver code is not really involved in updating the CPU mappings. Maybe
>>>> it's something we need to do in the migration helpers.
>>> It looks like I'm totally misunderstanding what you are adding here
>>> then.  Why do we need any special treatment at all for memory that
>>> has normal struct pages and is part of the direct kernel map?
>> The pages are like normal memory for purposes of mapping them in CPU
>> page tables and for coherent access from the CPU.
> That's the user page tables.  What about the kernel direct map?
> If there is a normal kernel struct page backing there really should
> be no need for the pgmap.

I'm not sure. The physical address ranges are in the UEFI system address
map as special-purpose memory. Does Linux create the struct pages and
kernel direct map for that without a pgmap call? I didn't see that last
time I went digging through that code.


>
>> From an application
>> perspective, we want file-backed and anonymous mappings to be able to
>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>> optimize performance for GPU heavy workloads while minimizing the need
>> to migrate data back-and-forth between system memory and device memory.
> I don't really understand that part.  file backed pages are always
> allocated by the file system using the pagecache helpers, that is
> using the page allocator.  Anonymouns memory also always comes from
> the page allocator.

I'm coming at this from my experience with DEVICE_PRIVATE. Both
anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
memory by the migrate_vma_* helpers for more efficient access by our
GPU. (*) It's part of the basic premise of HMM as I understand it. I
would expect the same thing to work for DEVICE_PUBLIC memory.

(*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
currently work, but that's something I'm hoping to fix at some point.


>
>> The pages are special in two ways:
>>
>>  1. The memory is managed not by the Linux buddy allocator, but by the
>>     GPU driver's TTM memory manager
> Why?

It's a system architectural decision based on the access latency to the
memory and the expected use cases that we do not want the GPU driver and
the Linux buddy allocator and VM subsystem competing for the same device
memory.


>
>>  2. We want to migrate data in response to GPU page faults and
>>     application hints using the migrate_vma helpers
> Why? 

Device memory has much higher bandwidth and much lower latency than
regular system memory for the GPU to access. It's essential for enabling
good GPU application performance. Page-based memory migration enables
good performance with more intuitive programming models such as
managed/unified memory in HIP or unified shared memory in OpenMP. We do
this on our discrete GPUs with DEVICE_PRIVATE memory.

I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows
the CPU to map the device memory coherently to minimize the need for
migrations when CPU and GPU access the same memory concurrently or
alternatingly. But we're not going as far as putting that memory
entirely under the management of the Linux memory manager and VM
subsystem. Our (and HPE's) system architects decided that this memory is
not suitable to be used like regular NUMA system memory by the Linux
memory manager.

Regards,
  Felix
Dave Chinner Sept. 1, 2021, 10:03 p.m. UTC | #10
On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> 
> Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> >>>> driver code is not really involved in updating the CPU mappings. Maybe
> >>>> it's something we need to do in the migration helpers.
> >>> It looks like I'm totally misunderstanding what you are adding here
> >>> then.  Why do we need any special treatment at all for memory that
> >>> has normal struct pages and is part of the direct kernel map?
> >> The pages are like normal memory for purposes of mapping them in CPU
> >> page tables and for coherent access from the CPU.
> > That's the user page tables.  What about the kernel direct map?
> > If there is a normal kernel struct page backing there really should
> > be no need for the pgmap.
> 
> I'm not sure. The physical address ranges are in the UEFI system address
> map as special-purpose memory. Does Linux create the struct pages and
> kernel direct map for that without a pgmap call? I didn't see that last
> time I went digging through that code.
> 
> 
> >
> >> From an application
> >> perspective, we want file-backed and anonymous mappings to be able to
> >> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> >> optimize performance for GPU heavy workloads while minimizing the need
> >> to migrate data back-and-forth between system memory and device memory.
> > I don't really understand that part.  file backed pages are always
> > allocated by the file system using the pagecache helpers, that is
> > using the page allocator.  Anonymouns memory also always comes from
> > the page allocator.
> 
> I'm coming at this from my experience with DEVICE_PRIVATE. Both
> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> memory by the migrate_vma_* helpers for more efficient access by our
> GPU. (*) It's part of the basic premise of HMM as I understand it. I
> would expect the same thing to work for DEVICE_PUBLIC memory.
> 
> (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
> currently work, but that's something I'm hoping to fix at some point.

FWIW, I'd love to see the architecture documents that define how
filesystems are supposed to interact with this device private
memory. This whole "hand filesystem controlled memory to other
devices" is a minefield that is trivial to get wrong iand very
difficult to fix - just look at the historical mess that RDMA
to/from file backed and/or DAX pages has been.

So, really, from my perspective as a filesystem engineer, I want to
see an actual specification for how this new memory type is going to
interact with filesystem and the page cache so everyone has some
idea of how this is going to work and can point out how it doesn't
work before code that simply doesn't work is pushed out into
production systems and then merged....

Cheers,

Dave.
Felix Kuehling Sept. 1, 2021, 11:07 p.m. UTC | #11
On 2021-09-01 6:03 p.m., Dave Chinner wrote:
> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
>> Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
>>> On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
>>>>>> driver code is not really involved in updating the CPU mappings. Maybe
>>>>>> it's something we need to do in the migration helpers.
>>>>> It looks like I'm totally misunderstanding what you are adding here
>>>>> then.  Why do we need any special treatment at all for memory that
>>>>> has normal struct pages and is part of the direct kernel map?
>>>> The pages are like normal memory for purposes of mapping them in CPU
>>>> page tables and for coherent access from the CPU.
>>> That's the user page tables.  What about the kernel direct map?
>>> If there is a normal kernel struct page backing there really should
>>> be no need for the pgmap.
>> I'm not sure. The physical address ranges are in the UEFI system address
>> map as special-purpose memory. Does Linux create the struct pages and
>> kernel direct map for that without a pgmap call? I didn't see that last
>> time I went digging through that code.
>>
>>
>>>>  From an application
>>>> perspective, we want file-backed and anonymous mappings to be able to
>>>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>>>> optimize performance for GPU heavy workloads while minimizing the need
>>>> to migrate data back-and-forth between system memory and device memory.
>>> I don't really understand that part.  file backed pages are always
>>> allocated by the file system using the pagecache helpers, that is
>>> using the page allocator.  Anonymouns memory also always comes from
>>> the page allocator.
>> I'm coming at this from my experience with DEVICE_PRIVATE. Both
>> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
>> memory by the migrate_vma_* helpers for more efficient access by our
>> GPU. (*) It's part of the basic premise of HMM as I understand it. I
>> would expect the same thing to work for DEVICE_PUBLIC memory.
>>
>> (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
>> currently work, but that's something I'm hoping to fix at some point.
> FWIW, I'd love to see the architecture documents that define how
> filesystems are supposed to interact with this device private
> memory. This whole "hand filesystem controlled memory to other
> devices" is a minefield that is trivial to get wrong iand very
> difficult to fix - just look at the historical mess that RDMA
> to/from file backed and/or DAX pages has been.
>
> So, really, from my perspective as a filesystem engineer, I want to
> see an actual specification for how this new memory type is going to
> interact with filesystem and the page cache so everyone has some
> idea of how this is going to work and can point out how it doesn't
> work before code that simply doesn't work is pushed out into
> production systems and then merged....

OK. To be clear, that's not part of this patch series. And I have no 
authority to push anything in this part of the kernel, so you have 
nothing to fear. ;)

FWIW, we already have the ability to map file-backed system memory pages 
into device page tables with HMM and interval notifiers. But we cannot 
currently migrate them to ZONE_DEVICE pages. Beyond that, my 
understanding of how filesystems and page cache work is rather 
superficial at this point. I'll keep your name in mind for when I am 
ready to discuss this in more detail.

Cheers,
   Felix


>
> Cheers,
>
> Dave.
Dave Chinner Sept. 2, 2021, 1:14 a.m. UTC | #12
On Wed, Sep 01, 2021 at 07:07:34PM -0400, Felix Kuehling wrote:
> On 2021-09-01 6:03 p.m., Dave Chinner wrote:
> > On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> > > Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> > > > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> > > > > > > driver code is not really involved in updating the CPU mappings. Maybe
> > > > > > > it's something we need to do in the migration helpers.
> > > > > > It looks like I'm totally misunderstanding what you are adding here
> > > > > > then.  Why do we need any special treatment at all for memory that
> > > > > > has normal struct pages and is part of the direct kernel map?
> > > > > The pages are like normal memory for purposes of mapping them in CPU
> > > > > page tables and for coherent access from the CPU.
> > > > That's the user page tables.  What about the kernel direct map?
> > > > If there is a normal kernel struct page backing there really should
> > > > be no need for the pgmap.
> > > I'm not sure. The physical address ranges are in the UEFI system address
> > > map as special-purpose memory. Does Linux create the struct pages and
> > > kernel direct map for that without a pgmap call? I didn't see that last
> > > time I went digging through that code.
> > > 
> > > 
> > > > >  From an application
> > > > > perspective, we want file-backed and anonymous mappings to be able to
> > > > > use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> > > > > optimize performance for GPU heavy workloads while minimizing the need
> > > > > to migrate data back-and-forth between system memory and device memory.
> > > > I don't really understand that part.  file backed pages are always
> > > > allocated by the file system using the pagecache helpers, that is
> > > > using the page allocator.  Anonymouns memory also always comes from
> > > > the page allocator.
> > > I'm coming at this from my experience with DEVICE_PRIVATE. Both
> > > anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> > > memory by the migrate_vma_* helpers for more efficient access by our
> > > GPU. (*) It's part of the basic premise of HMM as I understand it. I
> > > would expect the same thing to work for DEVICE_PUBLIC memory.
> > > 
> > > (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
> > > currently work, but that's something I'm hoping to fix at some point.
> > FWIW, I'd love to see the architecture documents that define how
> > filesystems are supposed to interact with this device private
> > memory. This whole "hand filesystem controlled memory to other
> > devices" is a minefield that is trivial to get wrong iand very
> > difficult to fix - just look at the historical mess that RDMA
> > to/from file backed and/or DAX pages has been.
> > 
> > So, really, from my perspective as a filesystem engineer, I want to
> > see an actual specification for how this new memory type is going to
> > interact with filesystem and the page cache so everyone has some
> > idea of how this is going to work and can point out how it doesn't
> > work before code that simply doesn't work is pushed out into
> > production systems and then merged....
> 
> OK. To be clear, that's not part of this patch series. And I have no
> authority to push anything in this part of the kernel, so you have nothing
> to fear. ;)

I know this isn't part of the series. but this patchset is laying
the foundation for future work that will impact subsystems that
currently have zero visibility and/or knowledge of these changes.
There must be an overall architectural plan for this functionality,
regardless of the current state of implementation. It's that overall
architectural plan I'm asking about here, because I need to
understand that before I can sanely comment on the page
cache/filesystem aspect of the proposed functionality...

> FWIW, we already have the ability to map file-backed system memory pages
> into device page tables with HMM and interval notifiers. But we cannot
> currently migrate them to ZONE_DEVICE pages.

Sure, but sharing page cache pages allocated and managed by the
filesystem is not what you are talking about. You're talking about
migrating page cache data to completely different memory allocated
by a different memory manager that the filesystems currently have no
knowledge of or have any way of interfacing with.

So I'm asking basic, fundamental questions about how these special
device based pages are going to work.  How are these pages different
to normal pages - does page_lock() still guarantee exclusive access
to the page state across all hardware that can access it? If not,
what provides access serialisation for pages that are allocated in
device memory rather than CPU memory (e.g. for truncate
serialisation)?  Does the hardware that own these pages raise page
faults on the CPU when those pages are accessed/dirtied? How does
demand paging in and out of device memory work (i.e. mapping files
larger than device memory).  How does IO to/from storage work - can
the filesystem build normal bios out of these device pages and issue
IO on them?  Are the additional constraints on IO because p2p DMA is
needed to move the data from the storage HBA directly into/out of
the GPU memory?

I can think of lots more complex questions about how filesystems are
supposed to manage remote device memory in the page cache, but these
are just some of the basic things that make file-backed mappings
different to anonymous mappings that I need to understand before I
can make head or tail of what is being proposed here.....

> Beyond that, my understanding
> of how filesystems and page cache work is rather superficial at this point.
> I'll keep your name in mind for when I am ready to discuss this in more
> detail.

If you don't know what the bigger picture is, then who does?
Somebody built the design/architecture you are working towards, and
they had to communicate it to you somehow. I'm asking for that
information to documented and made available to all the people these
changes might impact, not whether you personally know how it
works....

Cheers,

Dave.
Christoph Hellwig Sept. 2, 2021, 8:18 a.m. UTC | #13
On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> >>> It looks like I'm totally misunderstanding what you are adding here
> >>> then.  Why do we need any special treatment at all for memory that
> >>> has normal struct pages and is part of the direct kernel map?
> >> The pages are like normal memory for purposes of mapping them in CPU
> >> page tables and for coherent access from the CPU.
> > That's the user page tables.  What about the kernel direct map?
> > If there is a normal kernel struct page backing there really should
> > be no need for the pgmap.
> 
> I'm not sure. The physical address ranges are in the UEFI system address
> map as special-purpose memory. Does Linux create the struct pages and
> kernel direct map for that without a pgmap call? I didn't see that last
> time I went digging through that code.

So doing some googling finds a patch from Dan that claims to hand EFI
special purpose memory to the device dax driver.  But when I try to
follow the version that got merged it looks it is treated simply as an
MMIO region to be claimed by drivers, which would not get a struct page.

Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works?

> >> From an application
> >> perspective, we want file-backed and anonymous mappings to be able to
> >> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> >> optimize performance for GPU heavy workloads while minimizing the need
> >> to migrate data back-and-forth between system memory and device memory.
> > I don't really understand that part.  file backed pages are always
> > allocated by the file system using the pagecache helpers, that is
> > using the page allocator.  Anonymouns memory also always comes from
> > the page allocator.
> 
> I'm coming at this from my experience with DEVICE_PRIVATE. Both
> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> memory by the migrate_vma_* helpers for more efficient access by our
> GPU. (*) It's part of the basic premise of HMM as I understand it. I
> would expect the same thing to work for DEVICE_PUBLIC memory.

Ok, so you want to migrate to and from them.  Not use DEVICE_PUBLIC
for the actual page cache pages.  That maks a lot more sense.

> I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows
> the CPU to map the device memory coherently to minimize the need for
> migrations when CPU and GPU access the same memory concurrently or
> alternatingly. But we're not going as far as putting that memory
> entirely under the management of the Linux memory manager and VM
> subsystem. Our (and HPE's) system architects decided that this memory is
> not suitable to be used like regular NUMA system memory by the Linux
> memory manager.

So yes.  It is a Memory Mapped I/O region, which unlike the PCIe BARs
that people typically deal with is fully cache coherent.  I think this
does make more sense as a description.

But to go back to what start this discussion:  If these are memory
mapped I/O pfn_valid should generally not return true for them.

And as you already pointed out in reply to Alex we need to tighten the
selection criteria one way or another.
Dan Williams Sept. 2, 2021, 6:07 p.m. UTC | #14
On Thu, Sep 2, 2021 at 1:18 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> > >>> It looks like I'm totally misunderstanding what you are adding here
> > >>> then.  Why do we need any special treatment at all for memory that
> > >>> has normal struct pages and is part of the direct kernel map?
> > >> The pages are like normal memory for purposes of mapping them in CPU
> > >> page tables and for coherent access from the CPU.
> > > That's the user page tables.  What about the kernel direct map?
> > > If there is a normal kernel struct page backing there really should
> > > be no need for the pgmap.
> >
> > I'm not sure. The physical address ranges are in the UEFI system address
> > map as special-purpose memory. Does Linux create the struct pages and
> > kernel direct map for that without a pgmap call? I didn't see that last
> > time I went digging through that code.
>
> So doing some googling finds a patch from Dan that claims to hand EFI
> special purpose memory to the device dax driver.  But when I try to
> follow the version that got merged it looks it is treated simply as an
> MMIO region to be claimed by drivers, which would not get a struct page.
>
> Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works?

The original implementation of "soft reserve" support depended on the
combination of the EFI special purpose memory type and the ACPI HMAT
to define the device ranges. The requirement for ACPI HMAT was relaxed
later with commit:

5ccac54f3e12 ACPI: HMAT: attach a device for each soft-reserved range

The expectation is that system software policy can then either use the
device interface, assign a portion of the reservation back to the page
allocator, ignore the reservation altogether. Is this discussion
asking for a way to assign this memory to the GPU driver to manage?
device-dax already knows how to hand off to the page-allocator, seems
reasonable for it to be able to hand-off to another driver.
Felix Kuehling Sept. 9, 2021, 4:02 a.m. UTC | #15
Am 2021-09-02 um 4:18 a.m. schrieb Christoph Hellwig:
> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
>>>>> It looks like I'm totally misunderstanding what you are adding here
>>>>> then.  Why do we need any special treatment at all for memory that
>>>>> has normal struct pages and is part of the direct kernel map?
>>>> The pages are like normal memory for purposes of mapping them in CPU
>>>> page tables and for coherent access from the CPU.
>>> That's the user page tables.  What about the kernel direct map?
>>> If there is a normal kernel struct page backing there really should
>>> be no need for the pgmap.
>> I'm not sure. The physical address ranges are in the UEFI system address
>> map as special-purpose memory. Does Linux create the struct pages and
>> kernel direct map for that without a pgmap call? I didn't see that last
>> time I went digging through that code.
> So doing some googling finds a patch from Dan that claims to hand EFI
> special purpose memory to the device dax driver.  But when I try to
> follow the version that got merged it looks it is treated simply as an
> MMIO region to be claimed by drivers, which would not get a struct page.
>
> Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works?
>
>>>> From an application
>>>> perspective, we want file-backed and anonymous mappings to be able to
>>>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>>>> optimize performance for GPU heavy workloads while minimizing the need
>>>> to migrate data back-and-forth between system memory and device memory.
>>> I don't really understand that part.  file backed pages are always
>>> allocated by the file system using the pagecache helpers, that is
>>> using the page allocator.  Anonymouns memory also always comes from
>>> the page allocator.
>> I'm coming at this from my experience with DEVICE_PRIVATE. Both
>> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
>> memory by the migrate_vma_* helpers for more efficient access by our
>> GPU. (*) It's part of the basic premise of HMM as I understand it. I
>> would expect the same thing to work for DEVICE_PUBLIC memory.
> Ok, so you want to migrate to and from them.  Not use DEVICE_PUBLIC
> for the actual page cache pages.  That maks a lot more sense.
>
>> I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows
>> the CPU to map the device memory coherently to minimize the need for
>> migrations when CPU and GPU access the same memory concurrently or
>> alternatingly. But we're not going as far as putting that memory
>> entirely under the management of the Linux memory manager and VM
>> subsystem. Our (and HPE's) system architects decided that this memory is
>> not suitable to be used like regular NUMA system memory by the Linux
>> memory manager.
> So yes.  It is a Memory Mapped I/O region, which unlike the PCIe BARs
> that people typically deal with is fully cache coherent.  I think this
> does make more sense as a description.
>
> But to go back to what start this discussion:  If these are memory
> mapped I/O pfn_valid should generally not return true for them.

As I understand it, pfn_valid should be true for any pfn that's part of
the kernel's physical memory map, i.e. is returned by page_to_pfn or
works with pfn_to_page. Both the hmm_range_fault and the migrate_vma_*
APIs use pfns to refer to regular system memory and ZONE_DEVICE pages
(even DEVICE_PRIVATE). Therefore I believe pfn_valid should be true for
ZONE_DEVICE pages as well.

Regards,
  Felix


>
> And as you already pointed out in reply to Alex we need to tighten the
> selection criteria one way or another.
Felix Kuehling Sept. 9, 2021, 4:55 a.m. UTC | #16
Am 2021-09-01 um 9:14 p.m. schrieb Dave Chinner:
> On Wed, Sep 01, 2021 at 07:07:34PM -0400, Felix Kuehling wrote:
>> On 2021-09-01 6:03 p.m., Dave Chinner wrote:
>>> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
>>>> Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
>>>>> On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
>>>>>>>> driver code is not really involved in updating the CPU mappings. Maybe
>>>>>>>> it's something we need to do in the migration helpers.
>>>>>>> It looks like I'm totally misunderstanding what you are adding here
>>>>>>> then.  Why do we need any special treatment at all for memory that
>>>>>>> has normal struct pages and is part of the direct kernel map?
>>>>>> The pages are like normal memory for purposes of mapping them in CPU
>>>>>> page tables and for coherent access from the CPU.
>>>>> That's the user page tables.  What about the kernel direct map?
>>>>> If there is a normal kernel struct page backing there really should
>>>>> be no need for the pgmap.
>>>> I'm not sure. The physical address ranges are in the UEFI system address
>>>> map as special-purpose memory. Does Linux create the struct pages and
>>>> kernel direct map for that without a pgmap call? I didn't see that last
>>>> time I went digging through that code.
>>>>
>>>>
>>>>>>  From an application
>>>>>> perspective, we want file-backed and anonymous mappings to be able to
>>>>>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>>>>>> optimize performance for GPU heavy workloads while minimizing the need
>>>>>> to migrate data back-and-forth between system memory and device memory.
>>>>> I don't really understand that part.  file backed pages are always
>>>>> allocated by the file system using the pagecache helpers, that is
>>>>> using the page allocator.  Anonymouns memory also always comes from
>>>>> the page allocator.
>>>> I'm coming at this from my experience with DEVICE_PRIVATE. Both
>>>> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
>>>> memory by the migrate_vma_* helpers for more efficient access by our
>>>> GPU. (*) It's part of the basic premise of HMM as I understand it. I
>>>> would expect the same thing to work for DEVICE_PUBLIC memory.
>>>>
>>>> (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
>>>> currently work, but that's something I'm hoping to fix at some point.
>>> FWIW, I'd love to see the architecture documents that define how
>>> filesystems are supposed to interact with this device private
>>> memory. This whole "hand filesystem controlled memory to other
>>> devices" is a minefield that is trivial to get wrong iand very
>>> difficult to fix - just look at the historical mess that RDMA
>>> to/from file backed and/or DAX pages has been.
>>>
>>> So, really, from my perspective as a filesystem engineer, I want to
>>> see an actual specification for how this new memory type is going to
>>> interact with filesystem and the page cache so everyone has some
>>> idea of how this is going to work and can point out how it doesn't
>>> work before code that simply doesn't work is pushed out into
>>> production systems and then merged....
>> OK. To be clear, that's not part of this patch series. And I have no
>> authority to push anything in this part of the kernel, so you have nothing
>> to fear. ;)
> I know this isn't part of the series. but this patchset is laying
> the foundation for future work that will impact subsystems that
> currently have zero visibility and/or knowledge of these changes.

I don't think this patchset is the foundation. Jerome Glisse's work on
HMM is, which was merged 4 years ago and is being used by multiple
drivers now, with the AMD GPU driver being a fairly recent addition.


> There must be an overall architectural plan for this functionality,
> regardless of the current state of implementation. It's that overall
> architectural plan I'm asking about here, because I need to
> understand that before I can sanely comment on the page
> cache/filesystem aspect of the proposed functionality...

The overall HMM and ZONE_DEVICE architecture is documented to some
extent in Documentation/vm/hmm.rst, though it may not go into the level
of detail you are looking for.


>
>> FWIW, we already have the ability to map file-backed system memory pages
>> into device page tables with HMM and interval notifiers. But we cannot
>> currently migrate them to ZONE_DEVICE pages.
> Sure, but sharing page cache pages allocated and managed by the
> filesystem is not what you are talking about. You're talking about
> migrating page cache data to completely different memory allocated
> by a different memory manager that the filesystems currently have no
> knowledge of or have any way of interfacing with.

This is not part of the current patch series. It is only my intention to
look into ways to migrate file-backed pages to ZONE_DEVICE memory in the
future.


>
> So I'm asking basic, fundamental questions about how these special
> device based pages are going to work.  How are these pages different
> to normal pages - does page_lock() still guarantee exclusive access
> to the page state across all hardware that can access it?

Yes. The migration API guarantees that pages are locked during the
migration. The driver code doesn't touch the page state itself. It only
uses the migrate_vma_* helpers to deal with that.

This is not new or changed by this patch series.


>  If not,
> what provides access serialisation for pages that are allocated in
> device memory rather than CPU memory (e.g. for truncate
> serialisation)?  Does the hardware that own these pages raise page
> faults on the CPU when those pages are accessed/dirtied?

Yes. This is done by the hmm_range_fault API, which the driver calls in
order to populate its device page tables. It is synchronized with any
mapping changes through mmu_interval_notifiers.

This is not new or changed by this patch series.


>  How does
> demand paging in and out of device memory work (i.e. mapping files
> larger than device memory).

That depends on how the device driver handles device page faults. The
AMD GPU driver can handle recoverable device page faults and update the
device page table on demand with updated pfns from hmm_range_fault.

This is not new or changed by this patch series.


>   How does IO to/from storage work - can
> the filesystem build normal bios out of these device pages and issue
> IO on them?

DEVICE_PUBLIC pages introduced by this patch series, are CPU and
peer-accessible like normal system memory.

DEVICE_PRIVATE pages are not CPU or peer-accessible. Any access to them
would go through the CPU page fault path and cause a
dev_pagemap_ops.migrate_to_ram callback into the AMD GPU driver to unmap
the memory from the GPU and migrate it back to system memory.


>   Are the additional constraints on IO because p2p DMA is
> needed to move the data from the storage HBA directly into/out of
> the GPU memory?
>
> I can think of lots more complex questions about how filesystems are
> supposed to manage remote device memory in the page cache, but these
> are just some of the basic things that make file-backed mappings
> different to anonymous mappings that I need to understand before I
> can make head or tail of what is being proposed here.....
>
>> Beyond that, my understanding
>> of how filesystems and page cache work is rather superficial at this point.
>> I'll keep your name in mind for when I am ready to discuss this in more
>> detail.
> If you don't know what the bigger picture is, then who does?
> Somebody built the design/architecture you are working towards, and
> they had to communicate it to you somehow. I'm asking for that
> information to documented and made available to all the people these
> changes might impact, not whether you personally know how it
> works....

This patch series builds on top of existing HMM work with major
contributions from several people on this thread: Jerome Glisse, Jason
Gunthorpe, Christoph Hellwig, Ralph Campbell.

Beyond the reintroduction of DEVICE_PUBLIC memory in this patch series
I'm not looking to invent a major new design here. Immediate future work
is more about chipping away on a few remaining limitations of the
implementation, with respect to migration of file-backed pages and maybe
transparent huge pages.

Regards,
  Felix


>
> Cheers,
>
> Dave.
diff mbox series

Patch

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 4bb4e519e3f5..6b16f417384f 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -156,6 +156,7 @@  static inline unsigned long migrate_pfn(unsigned long pfn)
 enum migrate_vma_direction {
 	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
 	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+	MIGRATE_VMA_SELECT_IOMEM = 1 << 2,
 };
 
 struct migrate_vma {
diff --git a/mm/migrate.c b/mm/migrate.c
index e3a10e2a1bb3..d4ae2da99607 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2406,7 +2406,8 @@  static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (is_write_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
-			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
+			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
+			    !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
 				goto next;
 			pfn = pte_pfn(pte);
 			if (is_zero_pfn(pfn)) {