diff mbox series

[v6,01/10] mm: add zone device coherent type memory support

Message ID 20220201154901.7921-2-alex.sierra@amd.com (mailing list archive)
State Superseded
Headers show
Series Add MEMORY_DEVICE_COHERENT for coherent device memory mapping | expand

Commit Message

Sierra Guiza, Alejandro (Alex) Feb. 1, 2022, 3:48 p.m. UTC
Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
---
v4:
- use the same system entry path for coherent device pages at
migrate_vma_insert_page.

- Add coherent device type support for try_to_migrate /
try_to_migrate_one.
---
 include/linux/memremap.h |  8 +++++++
 include/linux/mm.h       | 16 ++++++++++++++
 mm/memcontrol.c          |  6 +++---
 mm/memory-failure.c      |  8 +++++--
 mm/memremap.c            | 14 ++++++++++++-
 mm/migrate.c             | 45 ++++++++++++++++++++--------------------
 mm/rmap.c                |  5 +++--
 7 files changed, 71 insertions(+), 31 deletions(-)

Comments

David Hildenbrand Feb. 11, 2022, 4:15 p.m. UTC | #1
On 01.02.22 16:48, Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
> 
> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> Reviewed-by: Alistair Popple <apopple@nvidia.com>

So, I'm currently messing with PageAnon() pages and CoW semantics ...
all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
easier but I'm not sure yet if they make my life harder. I hope you can
help me understand some of that stuff.

1) What are expected CoW semantics for DEVICE_COHERENT?

I assume we'll share them just like other PageAnon() pages during fork()
readable, and the first sharer writing to them receives an "ordinary"
!ZONE_DEVICE copy.

So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
that we don't have to go through the loop of restoring a device
exclusive entry?

2) How are these pages freed to clear/invalidate PageAnon() ?

I assume for PageAnon() ZONE_DEVICE pages we'll always for via
free_devmap_managed_page(), correct?


3) FOLL_PIN

While you write "no one should be allowed to pin such memory", patch #2
only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
you might want to be a bit more precise?


... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?


Thanks for any information.
David Hildenbrand Feb. 11, 2022, 4:39 p.m. UTC | #2
On 11.02.22 17:15, David Hildenbrand wrote:
> On 01.02.22 16:48, Alex Sierra wrote:
>> Device memory that is cache coherent from device and CPU point of view.
>> This is used on platforms that have an advanced system bus (like CAPI
>> or CXL). Any page of a process can be migrated to such memory. However,
>> no one should be allowed to pin such memory so that it can always be
>> evicted.
>>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> 
> So, I'm currently messing with PageAnon() pages and CoW semantics ...
> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
> easier but I'm not sure yet if they make my life harder. I hope you can
> help me understand some of that stuff.
> 
> 1) What are expected CoW semantics for DEVICE_COHERENT?
> 
> I assume we'll share them just like other PageAnon() pages during fork()
> readable, and the first sharer writing to them receives an "ordinary"
> !ZONE_DEVICE copy.
> 
> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
> that we don't have to go through the loop of restoring a device
> exclusive entry?
> 
> 2) How are these pages freed to clear/invalidate PageAnon() ?
> 
> I assume for PageAnon() ZONE_DEVICE pages we'll always for via
> free_devmap_managed_page(), correct?
> 
> 
> 3) FOLL_PIN
> 
> While you write "no one should be allowed to pin such memory", patch #2
> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
> you might want to be a bit more precise?
> 
> 
> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?
> 
> 
> Thanks for any information.
> 

(digging a bit more, I realized that device exclusive pages are not
actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT
will be the actual first PageAnon() ZONE_DEVICE pages that can be
present in a page table.)
Jason Gunthorpe Feb. 11, 2022, 4:45 p.m. UTC | #3
On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:

> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages

Currently the only way to get a DEVICE_PRIVATE page out of the page
tables is via hmm_range_fault() and that doesn't manipulate any ref
counts.

Jason
David Hildenbrand Feb. 11, 2022, 4:49 p.m. UTC | #4
On 11.02.22 17:45, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:
> 
>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages
> 
> Currently the only way to get a DEVICE_PRIVATE page out of the page
> tables is via hmm_range_fault() and that doesn't manipulate any ref
> counts.

Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are
essentially just pointers at ordinary PageAnon() pages. So with DEVICE
COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as
present in the page tables where GUP could FOLL_PIN them.
Jason Gunthorpe Feb. 11, 2022, 4:56 p.m. UTC | #5
On Fri, Feb 11, 2022 at 05:49:08PM +0100, David Hildenbrand wrote:
> On 11.02.22 17:45, Jason Gunthorpe wrote:
> > On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:
> > 
> >> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages
> > 
> > Currently the only way to get a DEVICE_PRIVATE page out of the page
> > tables is via hmm_range_fault() and that doesn't manipulate any ref
> > counts.
> 
> Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are
> essentially just pointers at ordinary PageAnon() pages. So with DEVICE
> COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as
> present in the page tables where GUP could FOLL_PIN them.

This is my understanding

Though you probably understand what PageAnon means alot better than I
do.. I wonder if it really makes sense to talk about that together
with ZONE_DEVICE which has alot in common with filesystem originated
pages too.

I'm not sure what AMDs plan is here, is there an expecation that a GPU
driver will somehow stuff these pages into an existing anonymous
memory VMA or do they always come from a driver originated VMA?

Jason
Felix Kuehling Feb. 11, 2022, 5:05 p.m. UTC | #6
Am 2022-02-11 um 11:15 schrieb David Hildenbrand:
> On 01.02.22 16:48, Alex Sierra wrote:
>> Device memory that is cache coherent from device and CPU point of view.
>> This is used on platforms that have an advanced system bus (like CAPI
>> or CXL). Any page of a process can be migrated to such memory. However,
>> no one should be allowed to pin such memory so that it can always be
>> evicted.
>>
>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> So, I'm currently messing with PageAnon() pages and CoW semantics ...
> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
> easier but I'm not sure yet if they make my life harder. I hope you can
> help me understand some of that stuff.
>
> 1) What are expected CoW semantics for DEVICE_COHERENT?
>
> I assume we'll share them just like other PageAnon() pages during fork()
> readable, and the first sharer writing to them receives an "ordinary"
> !ZONE_DEVICE copy.

Yes.


>
> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
> that we don't have to go through the loop of restoring a device
> exclusive entry?

I'm not sure how DEVICE_EXCLUSIVE pages are handled under CoW. As I 
understand it, they're not really in a special memory zone like 
DEVICE_COHERENT. Just a special way of mapping an ordinary page in order 
to allow device-exclusive access for some time. I suspect there may even 
be a possibility that a page can be both DEVICE_EXCLUSIVE and 
DEVICE_COHERENT.

That said, your statement sounds correct. There is no requirement to do 
anything with the new "ordinary" page after copying. What actually 
happens to DEVICE_COHERENT pages on CoW is a bit convoluted:

When the page is marked as CoW, it is marked R/O in the CPU page table. 
This causes an MMU notifier that invalidates the device PTE. The next 
device access in the parent process causes a page fault. If that's a 
write fault (usually is in our current driver), it will trigger CoW, 
which means the parent process now gets a new system memory copy of the 
page, while the child process keeps the DEVICE_COHERENT page. The driver 
could decide to migrate the page back to a new DEVICE_COHERENT allocation.

In practice that means, "fork" basically causes all DEVICE_COHERENT 
memory in the parent process to be migrated to ordinary system memory, 
which is quite disruptive. What we have today results in correct 
behaviour, but the performance is far from ideal.

We could probably mitigate it by making the driver better at mapping 
pages R/O in the device on read faults, at the potential cost of having 
to handle a second (write) fault later.


>
> 2) How are these pages freed to clear/invalidate PageAnon() ?
>
> I assume for PageAnon() ZONE_DEVICE pages we'll always for via
> free_devmap_managed_page(), correct?

Yes. The driver depends on the the page->pgmap->ops->page_free callback 
to free the device memory allocation backing the page.


>
>
> 3) FOLL_PIN
>
> While you write "no one should be allowed to pin such memory", patch #2
> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
> you might want to be a bit more precise?

I agree. I think the paragraph was written before we fully fleshed out 
the interaction with GUP, and the forgotten.


>
>
> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages,

Right. Trying to GUP a DEVICE_PRIVATE page causes a page fault that 
migrates the page back to normal system memory (using the 
page->pgmap->ops->migrate_to_ram callback). Then you pin the system 
memory page.


>   but can we
> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?

I assume you mean DEVICE_COHERENT, not DEVICE_EXCLUSIVE? In that case 
the answer is "Yes".

Regards,
   Felix


>
>
> Thanks for any information.
>
Felix Kuehling Feb. 11, 2022, 5:07 p.m. UTC | #7
Am 2022-02-11 um 11:39 schrieb David Hildenbrand:
> On 11.02.22 17:15, David Hildenbrand wrote:
>> On 01.02.22 16:48, Alex Sierra wrote:
>>> Device memory that is cache coherent from device and CPU point of view.
>>> This is used on platforms that have an advanced system bus (like CAPI
>>> or CXL). Any page of a process can be migrated to such memory. However,
>>> no one should be allowed to pin such memory so that it can always be
>>> evicted.
>>>
>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>> So, I'm currently messing with PageAnon() pages and CoW semantics ...
>> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
>> easier but I'm not sure yet if they make my life harder. I hope you can
>> help me understand some of that stuff.
>>
>> 1) What are expected CoW semantics for DEVICE_COHERENT?
>>
>> I assume we'll share them just like other PageAnon() pages during fork()
>> readable, and the first sharer writing to them receives an "ordinary"
>> !ZONE_DEVICE copy.
>>
>> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
>> that we don't have to go through the loop of restoring a device
>> exclusive entry?
>>
>> 2) How are these pages freed to clear/invalidate PageAnon() ?
>>
>> I assume for PageAnon() ZONE_DEVICE pages we'll always for via
>> free_devmap_managed_page(), correct?
>>
>>
>> 3) FOLL_PIN
>>
>> While you write "no one should be allowed to pin such memory", patch #2
>> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
>> you might want to be a bit more precise?
>>
>>
>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
>> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?
>>
>>
>> Thanks for any information.
>>
> (digging a bit more, I realized that device exclusive pages are not
> actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT
> will be the actual first PageAnon() ZONE_DEVICE pages that can be
> present in a page table.)

I think DEVICE_GENERIC pages can also be mapped in the page table. In 
fact, the first version of our patches attempted to add migration 
support to DEVICE_GENERIC. But we were convinced to create a new 
ZONE_DEVICE page type for our use case instead.

Regards,
   Felix


>
Alistair Popple Feb. 14, 2022, 2:04 a.m. UTC | #8
Felix Kuehling <felix.kuehling@amd.com> writes:

> Am 2022-02-11 um 11:15 schrieb David Hildenbrand:
>> On 01.02.22 16:48, Alex Sierra wrote:
>>> Device memory that is cache coherent from device and CPU point of view.
>>> This is used on platforms that have an advanced system bus (like CAPI
>>> or CXL). Any page of a process can be migrated to such memory. However,
>>> no one should be allowed to pin such memory so that it can always be
>>> evicted.
>>>
>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>> So, I’m currently messing with PageAnon() pages and CoW semantics …
>> all these PageAnon() ZONE_DEVICE variants don’t necessarily make my life
>> easier but I’m not sure yet if they make my life harder. I hope you can
>> help me understand some of that stuff.
>>
>> 1) What are expected CoW semantics for DEVICE_COHERENT?
>>
>> I assume we’ll share them just like other PageAnon() pages during fork()
>> readable, and the first sharer writing to them receives an “ordinary”
>> !ZONE_DEVICE copy.
>
> Yes.
>
>
>>
>> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
>> that we don’t have to go through the loop of restoring a device
>> exclusive entry?
>
> I’m not sure how DEVICE_EXCLUSIVE pages are handled under CoW. As I understand
> it, they’re not really in a special memory zone like DEVICE_COHERENT. Just a
> special way of mapping an ordinary page in order to allow device-exclusive
> access for some time. I suspect there may even be a possibility that a page can
> be both DEVICE_EXCLUSIVE and DEVICE_COHERENT.

Right - there aren’t really device exclusive pages, they are just special
non-present ptes conceptually pretty similar to migration entries. The
difference is that on CPU fault (or fork) the original entry is restored
immediately after notifying the device that it no longer has exclusive access.

As device exclusive entries can be turned into normal entries whenever required
we handle CoW by restoring the original ptes if a device exclusive entry is
encountered. This reduces the chances of introducing any subtle CoW bugs as it
just gets handled the same as any normal page table entry (because the exclusive
entries will have been removed).

> That said, your statement sounds correct. There is no requirement to do anything
> with the new “ordinary” page after copying. What actually happens to
> DEVICE_COHERENT pages on CoW is a bit convoluted:
>
> When the page is marked as CoW, it is marked R/O in the CPU page table. This
> causes an MMU notifier that invalidates the device PTE. The next device access
> in the parent process causes a page fault. If that’s a write fault (usually is
> in our current driver), it will trigger CoW, which means the parent process now
> gets a new system memory copy of the page, while the child process keeps the
> DEVICE_COHERENT page. The driver could decide to migrate the page back to a new
> DEVICE_COHERENT allocation.
>
> In practice that means, “fork” basically causes all DEVICE_COHERENT memory in
> the parent process to be migrated to ordinary system memory, which is quite
> disruptive. What we have today results in correct behaviour, but the performance
> is far from ideal.
>
> We could probably mitigate it by making the driver better at mapping pages R/O
> in the device on read faults, at the potential cost of having to handle a second
> (write) fault later.
>
>
>>
>> 2) How are these pages freed to clear/invalidate PageAnon() ?
>>
>> I assume for PageAnon() ZONE_DEVICE pages we’ll always for via
>> free_devmap_managed_page(), correct?
>
> Yes. The driver depends on the the page->pgmap->ops->page_free callback to free
> the device memory allocation backing the page.
>
>
>>
>>
>> 3) FOLL_PIN
>>
>> While you write “no one should be allowed to pin such memory”, patch #2
>> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
>> you might want to be a bit more precise?
>
> I agree. I think the paragraph was written before we fully fleshed out the
> interaction with GUP, and the forgotten.
>
>
>>
>>
>> … I’m pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages,
>
> Right. Trying to GUP a DEVICE_PRIVATE page causes a page fault that migrates the
> page back to normal system memory (using the page->pgmap->ops->migrate_to_ram
> callback). Then you pin the system memory page.
>
>
>>   but can we
>> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?

In the case of device exclusive entries GUP/PUP will fault and restore the
original entry. It will then pin the original normal page pointed to by the
device exclusive entry.

• Alistair

>
> I assume you mean DEVICE_COHERENT, not DEVICE_EXCLUSIVE? In that case the answer
> is “Yes”.
>
> Regards,
>   Felix
>
>
>>
>>
>> Thanks for any information.
>>
David Hildenbrand Feb. 15, 2022, 12:15 p.m. UTC | #9
On 11.02.22 17:56, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:49:08PM +0100, David Hildenbrand wrote:
>> On 11.02.22 17:45, Jason Gunthorpe wrote:
>>> On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:
>>>
>>>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages
>>>
>>> Currently the only way to get a DEVICE_PRIVATE page out of the page
>>> tables is via hmm_range_fault() and that doesn't manipulate any ref
>>> counts.
>>
>> Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are
>> essentially just pointers at ordinary PageAnon() pages. So with DEVICE
>> COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as
>> present in the page tables where GUP could FOLL_PIN them.
> 
> This is my understanding
> 
> Though you probably understand what PageAnon means alot better than I
> do.. I wonder if it really makes sense to talk about that together
> with ZONE_DEVICE which has alot in common with filesystem originated
> pages too.

For me, PageAnon() means that modifications are visible only to the
modifying process. On actual CoW, the underlying page will get replaced
-- in the world of DEVICE_COHERENT that would mean that once you write
to a DEVICE_COHERENT you could suddenly have a !DEVICE_COHERENT page.

PageAnon() pages don't have a mapping, thus they can only be found in
MAP_ANON VMAs or in MAP_SHARED VMAs with MAP_PRIVATE. They can only be
found via a page table, and not looked up via the page cache (excluding
the swap cache).

So if we have PageAnon() pages on ZONE_DEVICE, they generally have the
exact same semantics as !ZONE_DEVICE pages, but the way they "appear" in
the page tables the allocation/freeing path differs -- I guess :)

... and as we want pinning semantics to be different we have to touch GUP.

> 
> I'm not sure what AMDs plan is here, is there an expecation that a GPU
> driver will somehow stuff these pages into an existing anonymous
> memory VMA or do they always come from a driver originated VMA?

My understanding is that a driver can just decide to replace "ordinary"
PageAnon() pages e.g., in a MAP_ANON VMA by these pages. Hopefully AMD
can clarify.
David Hildenbrand Feb. 15, 2022, 12:16 p.m. UTC | #10
On 11.02.22 18:07, Felix Kuehling wrote:
> 
> Am 2022-02-11 um 11:39 schrieb David Hildenbrand:
>> On 11.02.22 17:15, David Hildenbrand wrote:
>>> On 01.02.22 16:48, Alex Sierra wrote:
>>>> Device memory that is cache coherent from device and CPU point of view.
>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>> no one should be allowed to pin such memory so that it can always be
>>>> evicted.
>>>>
>>>> Signed-off-by: Alex Sierra <alex.sierra@amd.com>
>>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>>> Reviewed-by: Alistair Popple <apopple@nvidia.com>
>>> So, I'm currently messing with PageAnon() pages and CoW semantics ...
>>> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
>>> easier but I'm not sure yet if they make my life harder. I hope you can
>>> help me understand some of that stuff.
>>>
>>> 1) What are expected CoW semantics for DEVICE_COHERENT?
>>>
>>> I assume we'll share them just like other PageAnon() pages during fork()
>>> readable, and the first sharer writing to them receives an "ordinary"
>>> !ZONE_DEVICE copy.
>>>
>>> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
>>> that we don't have to go through the loop of restoring a device
>>> exclusive entry?
>>>
>>> 2) How are these pages freed to clear/invalidate PageAnon() ?
>>>
>>> I assume for PageAnon() ZONE_DEVICE pages we'll always for via
>>> free_devmap_managed_page(), correct?
>>>
>>>
>>> 3) FOLL_PIN
>>>
>>> While you write "no one should be allowed to pin such memory", patch #2
>>> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
>>> you might want to be a bit more precise?
>>>
>>>
>>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
>>> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?
>>>
>>>
>>> Thanks for any information.
>>>
>> (digging a bit more, I realized that device exclusive pages are not
>> actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT
>> will be the actual first PageAnon() ZONE_DEVICE pages that can be
>> present in a page table.)
> 
> I think DEVICE_GENERIC pages can also be mapped in the page table. In 
> fact, the first version of our patches attempted to add migration 
> support to DEVICE_GENERIC. But we were convinced to create a new 
> ZONE_DEVICE page type for our use case instead.

Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
assumption was that they would be part of a special mapping.
Jason Gunthorpe Feb. 15, 2022, 2:45 p.m. UTC | #11
On Tue, Feb 15, 2022 at 01:16:43PM +0100, David Hildenbrand wrote:
> > fact, the first version of our patches attempted to add migration 
> > support to DEVICE_GENERIC. But we were convinced to create a new 
> > ZONE_DEVICE page type for our use case instead.
> 
> Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
> assumption was that they would be part of a special mapping.

We need to stop using the special PTEs and VMAs for things that have a
struct page. This is a mistake DAX created that must be undone.

Jason
Christoph Hellwig Feb. 15, 2022, 6:32 p.m. UTC | #12
On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
> > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
> > assumption was that they would be part of a special mapping.
> 
> We need to stop using the special PTEs and VMAs for things that have a
> struct page. This is a mistake DAX created that must be undone.

Yes, we'll get to it.  Maybe we can do it for the non-DAX devmap
ptes first given that DAX is more complicated.
Felix Kuehling Feb. 15, 2022, 6:52 p.m. UTC | #13
On 2022-02-15 07:15, David Hildenbrand wrote:
> On 11.02.22 17:56, Jason Gunthorpe wrote:
>> On Fri, Feb 11, 2022 at 05:49:08PM +0100, David Hildenbrand wrote:
>>> On 11.02.22 17:45, Jason Gunthorpe wrote:
>>>> On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:
>>>>
>>>>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages
>>>> Currently the only way to get a DEVICE_PRIVATE page out of the page
>>>> tables is via hmm_range_fault() and that doesn't manipulate any ref
>>>> counts.
>>> Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are
>>> essentially just pointers at ordinary PageAnon() pages. So with DEVICE
>>> COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as
>>> present in the page tables where GUP could FOLL_PIN them.
>> This is my understanding
>>
>> Though you probably understand what PageAnon means alot better than I
>> do.. I wonder if it really makes sense to talk about that together
>> with ZONE_DEVICE which has alot in common with filesystem originated
>> pages too.
> For me, PageAnon() means that modifications are visible only to the
> modifying process. On actual CoW, the underlying page will get replaced
> -- in the world of DEVICE_COHERENT that would mean that once you write
> to a DEVICE_COHERENT you could suddenly have a !DEVICE_COHERENT page.
>
> PageAnon() pages don't have a mapping, thus they can only be found in
> MAP_ANON VMAs or in MAP_SHARED VMAs with MAP_PRIVATE. They can only be
> found via a page table, and not looked up via the page cache (excluding
> the swap cache).
>
> So if we have PageAnon() pages on ZONE_DEVICE, they generally have the
> exact same semantics as !ZONE_DEVICE pages, but the way they "appear" in
> the page tables the allocation/freeing path differs -- I guess :)
>
> ... and as we want pinning semantics to be different we have to touch GUP.
>
>> I'm not sure what AMDs plan is here, is there an expecation that a GPU
>> driver will somehow stuff these pages into an existing anonymous
>> memory VMA or do they always come from a driver originated VMA?
> My understanding is that a driver can just decide to replace "ordinary"
> PageAnon() pages e.g., in a MAP_ANON VMA by these pages. Hopefully AMD
> can clarify.

Yes. DEVICE_COHERENT pages are very similar to DEVICE_PRIVATE. They are 
in normal anonymous VMAs (e.g. the application called mmap(..., 
MAP_ANONYMOUS, ...)). You get DEVICE_PRIVATE/COHERENT pages as a result 
of the driver migrating normal anonymous pages into device memory. The 
main difference is, that DEVICE_PRIVATE pages aren't host accessible, 
while DEVICE_COHERENT pages are host-accessible and can be mapped in the 
CPU page table or DMA-mapped by peer devices. That's why GUP needs to 
deal with them.

Regards,
   Felix


>
>
Jason Gunthorpe Feb. 15, 2022, 7:41 p.m. UTC | #14
On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
> > > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
> > > assumption was that they would be part of a special mapping.
> > 
> > We need to stop using the special PTEs and VMAs for things that have a
> > struct page. This is a mistake DAX created that must be undone.
> 
> Yes, we'll get to it.  Maybe we can do it for the non-DAX devmap
> ptes first given that DAX is more complicated.

Probably, I think we can check the page->pgmap type to tell the
difference.

I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
made safe by using the unmap_mapping_range(), which won't work
here. Is there some other trick being used to keep track of references
inside the AMD driver?

Jason
Felix Kuehling Feb. 15, 2022, 9:35 p.m. UTC | #15
On 2022-02-15 14:41, Jason Gunthorpe wrote:
> On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
>> On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
>>>> Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
>>>> assumption was that they would be part of a special mapping.
>>> We need to stop using the special PTEs and VMAs for things that have a
>>> struct page. This is a mistake DAX created that must be undone.
>> Yes, we'll get to it.  Maybe we can do it for the non-DAX devmap
>> ptes first given that DAX is more complicated.
> Probably, I think we can check the page->pgmap type to tell the
> difference.
>
> I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
> made safe by using the unmap_mapping_range(), which won't work
> here. Is there some other trick being used to keep track of references
> inside the AMD driver?

Not sure I'm following all the discussion about VMAs and DAX. So I may 
be answering the wrong question: We treat each ZONE_DEVICE page as a 
reference to the BO (buffer object) that backs the page. We increment 
the BO refcount for each page we migrate into it. In the 
dev_pagemap_ops.page_free callback we drop that reference. Once all 
pages backed by a BO are freed, the BO refcount reaches 0 [*] and we can 
free the BO allocation.

Regards,
   Felix


[*] That's a somewhat simplified view. There may be other references to 
the BO, which allows us to reuse the same BO for the same virtual 
address range.

>
> Jason
Jason Gunthorpe Feb. 15, 2022, 9:47 p.m. UTC | #16
On Tue, Feb 15, 2022 at 04:35:56PM -0500, Felix Kuehling wrote:
> 
> On 2022-02-15 14:41, Jason Gunthorpe wrote:
> > On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
> > > On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
> > > > > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
> > > > > assumption was that they would be part of a special mapping.
> > > > We need to stop using the special PTEs and VMAs for things that have a
> > > > struct page. This is a mistake DAX created that must be undone.
> > > Yes, we'll get to it.  Maybe we can do it for the non-DAX devmap
> > > ptes first given that DAX is more complicated.
> > Probably, I think we can check the page->pgmap type to tell the
> > difference.
> > 
> > I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
> > made safe by using the unmap_mapping_range(), which won't work
> > here. Is there some other trick being used to keep track of references
> > inside the AMD driver?
> 
> Not sure I'm following all the discussion about VMAs and DAX. So I may be
> answering the wrong question: We treat each ZONE_DEVICE page as a reference
> to the BO (buffer object) that backs the page. We increment the BO refcount
> for each page we migrate into it. In the dev_pagemap_ops.page_free callback
> we drop that reference. Once all pages backed by a BO are freed, the BO
> refcount reaches 0 [*] and we can free the BO allocation.

Userspace does
 1) mmap(MAP_PRIVATE) to allocate anon memory
 2) something to trigger migration to install a ZONE_DEVICE page
 3) munmap()

Who decrements the refcout on the munmap?

When a ZONE_DEVICE page is installed in the PTE is supposed to be
marked as pte_devmap and that disables all the normal page refcounting
during munmap().

fsdax makes this work by working the refcounts backwards, the page is
refcounted while it exists in the driver, when the driver decides to
remove it then unmap_mapping_range() is called to purge it from all
PTEs and then refcount is decrd. munmap/fork/etc don't change the
refcount.

Jason
Felix Kuehling Feb. 15, 2022, 10:49 p.m. UTC | #17
On 2022-02-15 16:47, Jason Gunthorpe wrote:
> On Tue, Feb 15, 2022 at 04:35:56PM -0500, Felix Kuehling wrote:
>> On 2022-02-15 14:41, Jason Gunthorpe wrote:
>>> On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
>>>> On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
>>>>>> Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
>>>>>> assumption was that they would be part of a special mapping.
>>>>> We need to stop using the special PTEs and VMAs for things that have a
>>>>> struct page. This is a mistake DAX created that must be undone.
>>>> Yes, we'll get to it.  Maybe we can do it for the non-DAX devmap
>>>> ptes first given that DAX is more complicated.
>>> Probably, I think we can check the page->pgmap type to tell the
>>> difference.
>>>
>>> I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
>>> made safe by using the unmap_mapping_range(), which won't work
>>> here. Is there some other trick being used to keep track of references
>>> inside the AMD driver?
>> Not sure I'm following all the discussion about VMAs and DAX. So I may be
>> answering the wrong question: We treat each ZONE_DEVICE page as a reference
>> to the BO (buffer object) that backs the page. We increment the BO refcount
>> for each page we migrate into it. In the dev_pagemap_ops.page_free callback
>> we drop that reference. Once all pages backed by a BO are freed, the BO
>> refcount reaches 0 [*] and we can free the BO allocation.
> Userspace does
>   1) mmap(MAP_PRIVATE) to allocate anon memory
>   2) something to trigger migration to install a ZONE_DEVICE page
>   3) munmap()
>
> Who decrements the refcout on the munmap?
>
> When a ZONE_DEVICE page is installed in the PTE is supposed to be
> marked as pte_devmap and that disables all the normal page refcounting
> during munmap().
>
> fsdax makes this work by working the refcounts backwards, the page is
> refcounted while it exists in the driver, when the driver decides to
> remove it then unmap_mapping_range() is called to purge it from all
> PTEs and then refcount is decrd. munmap/fork/etc don't change the
> refcount.

Hmm, that just means, whether or not there are PTEs doesn't really 
matter. It should work the same as it does for DEVICE_PRIVATE pages. I'm 
not sure where DEVICE_PRIVATE page's refcounts are decremented on unmap, 
TBH. But I can't find it in our driver, or in the test_hmm driver for 
that matter.

Regards,
   Felix

>
> Jason
Alistair Popple Feb. 16, 2022, 1:23 a.m. UTC | #18
Jason Gunthorpe <jgg@nvidia.com> writes:

> On Tue, Feb 15, 2022 at 04:35:56PM -0500, Felix Kuehling wrote:
>>
>> On 2022-02-15 14:41, Jason Gunthorpe wrote:
>> > On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
>> > > On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
>> > > > > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
>> > > > > assumption was that they would be part of a special mapping.
>> > > > We need to stop using the special PTEs and VMAs for things that have a
>> > > > struct page. This is a mistake DAX created that must be undone.
>> > > Yes, we'll get to it.  Maybe we can do it for the non-DAX devmap
>> > > ptes first given that DAX is more complicated.
>> > Probably, I think we can check the page->pgmap type to tell the
>> > difference.
>> >
>> > I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
>> > made safe by using the unmap_mapping_range(), which won't work
>> > here. Is there some other trick being used to keep track of references
>> > inside the AMD driver?
>>
>> Not sure I'm following all the discussion about VMAs and DAX. So I may be
>> answering the wrong question: We treat each ZONE_DEVICE page as a reference
>> to the BO (buffer object) that backs the page. We increment the BO refcount
>> for each page we migrate into it. In the dev_pagemap_ops.page_free callback
>> we drop that reference. Once all pages backed by a BO are freed, the BO
>> refcount reaches 0 [*] and we can free the BO allocation.
>
> Userspace does
>  1) mmap(MAP_PRIVATE) to allocate anon memory
>  2) something to trigger migration to install a ZONE_DEVICE page
>  3) munmap()
>
> Who decrements the refcout on the munmap?
>
> When a ZONE_DEVICE page is installed in the PTE is supposed to be
> marked as pte_devmap and that disables all the normal page refcounting
> during munmap().

Device private and device coherent pages are not marked with pte_devmap and they
are backed by a struct page. The only way of inserting them is via migrate_vma.
The refcount is decremented in zap_pte_range() on munmap() with special handling
for device private pages. Looking at it again though I wonder if there is any
special treatment required in zap_pte_range() for device coherent pages given
they count as present pages.

> fsdax makes this work by working the refcounts backwards, the page is
> refcounted while it exists in the driver, when the driver decides to
> remove it then unmap_mapping_range() is called to purge it from all
> PTEs and then refcount is decrd. munmap/fork/etc don't change the
> refcount.

The equivalent here is for drivers to use migrate_vma to migrate the pages back
from device memory to CPU memory. In this case the refcounting is (mostly)
handled by migration code which decrements the refcount on the original source
device page during the migration.

- Alistair

> Jason
Jason Gunthorpe Feb. 16, 2022, 2:01 a.m. UTC | #19
On Tue, Feb 15, 2022 at 05:49:07PM -0500, Felix Kuehling wrote:

> > Userspace does
> >   1) mmap(MAP_PRIVATE) to allocate anon memory
> >   2) something to trigger migration to install a ZONE_DEVICE page
> >   3) munmap()
> > 
> > Who decrements the refcout on the munmap?
> > 
> > When a ZONE_DEVICE page is installed in the PTE is supposed to be
> > marked as pte_devmap and that disables all the normal page refcounting
> > during munmap().
> > 
> > fsdax makes this work by working the refcounts backwards, the page is
> > refcounted while it exists in the driver, when the driver decides to
> > remove it then unmap_mapping_range() is called to purge it from all
> > PTEs and then refcount is decrd. munmap/fork/etc don't change the
> > refcount.
>
> Hmm, that just means, whether or not there are PTEs doesn't really
> matter.

Yes, that is the FSDAX model

> It should work the same as it does for DEVICE_PRIVATE pages. I'm not sure
> where DEVICE_PRIVATE page's refcounts are decremented on unmap, TBH. But I
> can't find it in our driver, or in the test_hmm driver for that matter.

It is not the same as DEVICE_PRIVATE because DEVICE_PRIVATE uses swap
entries. The put_page for that case is here:

static unsigned long zap_pte_range(struct mmu_gather *tlb,
				struct vm_area_struct *vma, pmd_t *pmd,
				unsigned long addr, unsigned long end,
				struct zap_details *details)
{
[..]
		if (is_device_private_entry(entry) ||
		    is_device_exclusive_entry(entry)) {
			struct page *page = pfn_swap_entry_to_page(entry);

			if (unlikely(zap_skip_check_mapping(details, page)))
				continue;
			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
			rss[mm_counter(page)]--;

			if (is_device_private_entry(entry))
				page_remove_rmap(page, false);

			put_page(page);

However the devmap case will return NULL from vm_normal_page() and won't
do the put_page() embedded inside the __tlb_remove_page() in the
pte_present() block in the same function.

After reflecting for awhile, I think Christoph's idea is quite
good. Just make it so you don't set pte_devmap() on your pages and
then lets avoid pte_devmap for all refcount correct ZONE_DEVICE pages.

Jason
Jason Gunthorpe Feb. 16, 2022, 2:03 a.m. UTC | #20
On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:

> Device private and device coherent pages are not marked with pte_devmap and they
> are backed by a struct page. The only way of inserting them is via migrate_vma.
> The refcount is decremented in zap_pte_range() on munmap() with special handling
> for device private pages. Looking at it again though I wonder if there is any
> special treatment required in zap_pte_range() for device coherent pages given
> they count as present pages.

This is what I guessed, but we shouldn't be able to just drop
pte_devmap on these pages without any other work?? Granted it does
very little already..

I thought at least gup_fast needed to be touched or did this get
handled by scanning the page list after the fact?

Jason
Alistair Popple Feb. 16, 2022, 2:36 a.m. UTC | #21
On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
> 
> > Device private and device coherent pages are not marked with pte_devmap and they
> > are backed by a struct page. The only way of inserting them is via migrate_vma.
> > The refcount is decremented in zap_pte_range() on munmap() with special handling
> > for device private pages. Looking at it again though I wonder if there is any
> > special treatment required in zap_pte_range() for device coherent pages given
> > they count as present pages.
> 
> This is what I guessed, but we shouldn't be able to just drop
> pte_devmap on these pages without any other work?? Granted it does
> very little already..

Yes, I agree we need to check this more closely. For device private pages
not having pte_devmap is fine, because they are non-present swap entries so
they always get special handling in the swap entry paths but the same isn't
true for coherent device pages.

> I thought at least gup_fast needed to be touched or did this get
> handled by scanning the page list after the fact?

Right, for gup I think the only special handling required is to prevent
pinning. I had assumed that check_and_migrate_movable_pages() would still get
called for gup_fast but unless I've missed something I don't think it does.
That means gup_fast could still pin movable and coherent pages. Technically
that is ok for coherent pages, but it's undesirable.

 - Alistair

> Jason
>
David Hildenbrand Feb. 16, 2022, 8:31 a.m. UTC | #22
On 16.02.22 03:36, Alistair Popple wrote:
> On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
>> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
>>
>>> Device private and device coherent pages are not marked with pte_devmap and they
>>> are backed by a struct page. The only way of inserting them is via migrate_vma.
>>> The refcount is decremented in zap_pte_range() on munmap() with special handling
>>> for device private pages. Looking at it again though I wonder if there is any
>>> special treatment required in zap_pte_range() for device coherent pages given
>>> they count as present pages.
>>
>> This is what I guessed, but we shouldn't be able to just drop
>> pte_devmap on these pages without any other work?? Granted it does
>> very little already..
> 
> Yes, I agree we need to check this more closely. For device private pages
> not having pte_devmap is fine, because they are non-present swap entries so
> they always get special handling in the swap entry paths but the same isn't
> true for coherent device pages.

I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page
look like when mapped? I'd assume it's also (currently) still offset by
one, meaning, if it's mapped into a single page table it's always at
least 2.

Just a note that if my assumption is correct and if we'd have such a
page mapped R/O, do_wp_page() would always have to copy it
unconditionally and would not be able to reuse it on write faults.
(while I'm working on improving the reuse logic, I think there is also
work in progress to avoid this additional reference on some ZONE_DEVICE
stuff -- I'd assume that would include DEVICE_COHERENT ?)

> 
>> I thought at least gup_fast needed to be touched or did this get
>> handled by scanning the page list after the fact?
> 
> Right, for gup I think the only special handling required is to prevent
> pinning. I had assumed that check_and_migrate_movable_pages() would still get
> called for gup_fast but unless I've missed something I don't think it does.
> That means gup_fast could still pin movable and coherent pages. Technically
> that is ok for coherent pages, but it's undesirable.

We really should have the same pinning rules for GUP vs. GUP-fast.
is_pinnable_page() should be the right place for such checks (similarly
as indicated in my reply to the migration series).
Jason Gunthorpe Feb. 16, 2022, 12:26 p.m. UTC | #23
On Wed, Feb 16, 2022 at 09:31:03AM +0100, David Hildenbrand wrote:
> On 16.02.22 03:36, Alistair Popple wrote:
> > On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
> >> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
> >>
> >>> Device private and device coherent pages are not marked with pte_devmap and they
> >>> are backed by a struct page. The only way of inserting them is via migrate_vma.
> >>> The refcount is decremented in zap_pte_range() on munmap() with special handling
> >>> for device private pages. Looking at it again though I wonder if there is any
> >>> special treatment required in zap_pte_range() for device coherent pages given
> >>> they count as present pages.
> >>
> >> This is what I guessed, but we shouldn't be able to just drop
> >> pte_devmap on these pages without any other work?? Granted it does
> >> very little already..
> > 
> > Yes, I agree we need to check this more closely. For device private pages
> > not having pte_devmap is fine, because they are non-present swap entries so
> > they always get special handling in the swap entry paths but the same isn't
> > true for coherent device pages.
> 
> I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page
> look like when mapped? I'd assume it's also (currently) still offset by
> one, meaning, if it's mapped into a single page table it's always at
> least 2.

Christoph fixed this offset by one and updated the DEVICE_COHERENT
patchset, I hope we will see that version merged.

> >> I thought at least gup_fast needed to be touched or did this get
> >> handled by scanning the page list after the fact?
> > 
> > Right, for gup I think the only special handling required is to prevent
> > pinning. I had assumed that check_and_migrate_movable_pages() would still get
> > called for gup_fast but unless I've missed something I don't think it does.
> > That means gup_fast could still pin movable and coherent pages. Technically
> > that is ok for coherent pages, but it's undesirable.
> 
> We really should have the same pinning rules for GUP vs. GUP-fast.
> is_pinnable_page() should be the right place for such checks (similarly
> as indicated in my reply to the migration series).

Yes, I think this is a bug too.

The other place that needs careful audit is all the callers using
vm_normal_page() - they must all be able to accept a ZONE_DEVICE page
if we don't set pte_devmap.

Jason
Felix Kuehling Feb. 16, 2022, 4:56 p.m. UTC | #24
Am 2022-02-15 um 21:01 schrieb Jason Gunthorpe:
> On Tue, Feb 15, 2022 at 05:49:07PM -0500, Felix Kuehling wrote:
>
>>> Userspace does
>>>    1) mmap(MAP_PRIVATE) to allocate anon memory
>>>    2) something to trigger migration to install a ZONE_DEVICE page
>>>    3) munmap()
>>>
>>> Who decrements the refcout on the munmap?
>>>
>>> When a ZONE_DEVICE page is installed in the PTE is supposed to be
>>> marked as pte_devmap and that disables all the normal page refcounting
>>> during munmap().
>>>
>>> fsdax makes this work by working the refcounts backwards, the page is
>>> refcounted while it exists in the driver, when the driver decides to
>>> remove it then unmap_mapping_range() is called to purge it from all
>>> PTEs and then refcount is decrd. munmap/fork/etc don't change the
>>> refcount.
>> Hmm, that just means, whether or not there are PTEs doesn't really
>> matter.
> Yes, that is the FSDAX model
>
>> It should work the same as it does for DEVICE_PRIVATE pages. I'm not sure
>> where DEVICE_PRIVATE page's refcounts are decremented on unmap, TBH. But I
>> can't find it in our driver, or in the test_hmm driver for that matter.
> It is not the same as DEVICE_PRIVATE because DEVICE_PRIVATE uses swap
> entries. The put_page for that case is here:
>
> static unsigned long zap_pte_range(struct mmu_gather *tlb,
> 				struct vm_area_struct *vma, pmd_t *pmd,
> 				unsigned long addr, unsigned long end,
> 				struct zap_details *details)
> {
> [..]
> 		if (is_device_private_entry(entry) ||
> 		    is_device_exclusive_entry(entry)) {
> 			struct page *page = pfn_swap_entry_to_page(entry);
>
> 			if (unlikely(zap_skip_check_mapping(details, page)))
> 				continue;
> 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> 			rss[mm_counter(page)]--;
>
> 			if (is_device_private_entry(entry))
> 				page_remove_rmap(page, false);
>
> 			put_page(page);
>
> However the devmap case will return NULL from vm_normal_page() and won't
> do the put_page() embedded inside the __tlb_remove_page() in the
> pte_present() block in the same function.
>
> After reflecting for awhile, I think Christoph's idea is quite
> good. Just make it so you don't set pte_devmap() on your pages and
> then lets avoid pte_devmap for all refcount correct ZONE_DEVICE pages.

I'm not sure if pte_devmap is actually set for our DEVICE_COHERENT 
pages. As far as I can tell, this comes from a bit in the pfn:

    #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
    #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
    ...
    static inline bool pfn_t_devmap(pfn_t pfn)
    {
             const u64 flags = PFN_DEV|PFN_MAP;

             return (pfn.val & flags) == flags;
    }

In the case of DEVICE_COHERENT memory, the pfns correspond to real 
physical memory addresses. I don't think they have those PFN_DEV|PFN_MAP 
bits set.

Regards,
   Felix


>
> Jason
Jason Gunthorpe Feb. 16, 2022, 5:28 p.m. UTC | #25
On Wed, Feb 16, 2022 at 11:56:51AM -0500, Felix Kuehling wrote:

> In the case of DEVICE_COHERENT memory, the pfns correspond to real physical
> memory addresses. I don't think they have those PFN_DEV|PFN_MAP bits set.

So do DAX pages. The PTE flag does several things. As this would be
the first time ZONE_DEVICE pages do not set devmap it needs a full
audit.

eg the gup_fast bug Alistair pointed at needs fixing at least.

Jason
Alistair Popple Feb. 17, 2022, 1:05 a.m. UTC | #26
Jason Gunthorpe <jgg@nvidia.com> writes:

> On Wed, Feb 16, 2022 at 09:31:03AM +0100, David Hildenbrand wrote:
>> On 16.02.22 03:36, Alistair Popple wrote:
>> > On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
>> >> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
>> >>
>> >>> Device private and device coherent pages are not marked with pte_devmap and they
>> >>> are backed by a struct page. The only way of inserting them is via migrate_vma.
>> >>> The refcount is decremented in zap_pte_range() on munmap() with special handling
>> >>> for device private pages. Looking at it again though I wonder if there is any
>> >>> special treatment required in zap_pte_range() for device coherent pages given
>> >>> they count as present pages.
>> >>
>> >> This is what I guessed, but we shouldn't be able to just drop
>> >> pte_devmap on these pages without any other work?? Granted it does
>> >> very little already..
>> >
>> > Yes, I agree we need to check this more closely. For device private pages
>> > not having pte_devmap is fine, because they are non-present swap entries so
>> > they always get special handling in the swap entry paths but the same isn't
>> > true for coherent device pages.
>>
>> I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page
>> look like when mapped? I'd assume it's also (currently) still offset by
>> one, meaning, if it's mapped into a single page table it's always at
>> least 2.
>
> Christoph fixed this offset by one and updated the DEVICE_COHERENT
> patchset, I hope we will see that version merged.
>
>> >> I thought at least gup_fast needed to be touched or did this get
>> >> handled by scanning the page list after the fact?
>> >
>> > Right, for gup I think the only special handling required is to prevent
>> > pinning. I had assumed that check_and_migrate_movable_pages() would still get
>> > called for gup_fast but unless I've missed something I don't think it does.
>> > That means gup_fast could still pin movable and coherent pages. Technically
>> > that is ok for coherent pages, but it's undesirable.
>>
>> We really should have the same pinning rules for GUP vs. GUP-fast.
>> is_pinnable_page() should be the right place for such checks (similarly
>> as indicated in my reply to the migration series).
>
> Yes, I think this is a bug too.

Agreed, I will add a fix for it to my series as I was surprised the rules for
PUP-fast were different. I can see how this happened though -
check_and_migrate_cma_pages() (the precursor to
check_and_migrate_movable_pages()) was added before PUP-fast and FOLL_LONGTERM
so I guess we just never added this check there.

- Alistair

> The other place that needs careful audit is all the callers using
> vm_normal_page() - they must all be able to accept a ZONE_DEVICE page
> if we don't set pte_devmap.
>
> Jason
Felix Kuehling Feb. 17, 2022, 9:12 p.m. UTC | #27
Am 2022-02-16 um 07:26 schrieb Jason Gunthorpe:
> The other place that needs careful audit is all the callers using
> vm_normal_page() - they must all be able to accept a ZONE_DEVICE page
> if we don't set pte_devmap.

How much code are we talking about here? A quick search finds 26 
call-sites in 12 files in current master:

    fs/proc/task_mmu.c
    mm/hmm.c
    mm/gup.c
    mm/huge_memory.c (vm_normal_page_pmd)
    mm/khugepaged.c
    mm/madvise.c
    mm/mempolicy.c
    mm/memory.c
    mm/mlock.c
    mm/migrate.c
    mm/mprotect.c
    mm/memcontrol.c

I'm thinking of a more theoretical approach: Instead of auditing all 
users, I'd ask, what are the invariants that a vm_normal_page should 
have. Then check, whether our DEVICE_COHERENT pages satisfy them. But 
maybe the concept of a vm_normal_page isn't defined clearly enough for that.

That said, I think we (Alex and myself) made an implicit assumption from 
the start, that a DEVICE_COHERENT page should behave a lot like a normal 
page in terms of VMA mappings, even if we didn't know what that means in 
detail.

I can now at least name some differences between DEVICE_COHERENT and 
normal pages: how the memory is allocated, how data is migrated into 
DEVICE_COHERENT pages and that it can't be on any LRU list (because the 
lru list_head in struct page is aliased by pgmap and zone_device_data). 
Maybe I'll find more differences if I keep digging.

Regards,
   Felix


>
> Jason
Jason Gunthorpe Feb. 18, 2022, 12:19 a.m. UTC | #28
On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:

> I'm thinking of a more theoretical approach: Instead of auditing all users,
> I'd ask, what are the invariants that a vm_normal_page should have. Then
> check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept
> of a vm_normal_page isn't defined clearly enough for that.

I would say the expectation is that only 'page cache and anon' pages
are returned - ie the first union in struct page

This is because the first file in your list I looked at:

static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
				unsigned long addr, unsigned long end,
				struct mm_walk *walk)

{
		page = vm_normal_page(vma, addr, ptent);
[..]
		if (pageout) {
			if (!isolate_lru_page(page)) {

Uses the LRU field, so this is incompatible with all the other page
types.

One mitigation of this might be to formally make vm_normal_page() ==
'pte to page cache and anon page' and add a new function that is 'pte
to any struct page'

Then go through and sort callers as appropriate.

The 'pte to page cache and anon page' can detect ZONE_DEVICE by
calling is_zone_device_page() insted of pte_devmap() and then continue
to return NULL. This same trick will fix GUP_fast.

Jason
Alistair Popple Feb. 18, 2022, 12:59 a.m. UTC | #29
Felix Kuehling <felix.kuehling@amd.com> writes:

> Am 2022-02-16 um 07:26 schrieb Jason Gunthorpe:
>> The other place that needs careful audit is all the callers using
>> vm_normal_page() - they must all be able to accept a ZONE_DEVICE page
>> if we don't set pte_devmap.
>
> How much code are we talking about here? A quick search finds 26 call-sites in
> 12 files in current master:
>
>    fs/proc/task_mmu.c
>    mm/hmm.c
>    mm/gup.c
>    mm/huge_memory.c (vm_normal_page_pmd)
>    mm/khugepaged.c
>    mm/madvise.c
>    mm/mempolicy.c
>    mm/memory.c
>    mm/mlock.c
>    mm/migrate.c
>    mm/mprotect.c
>    mm/memcontrol.c
>
> I'm thinking of a more theoretical approach: Instead of auditing all users, I'd
> ask, what are the invariants that a vm_normal_page should have. Then check,
> whether our DEVICE_COHERENT pages satisfy them. But maybe the concept of a
> vm_normal_page isn't defined clearly enough for that.
>
> That said, I think we (Alex and myself) made an implicit assumption from the
> start, that a DEVICE_COHERENT page should behave a lot like a normal page in
> terms of VMA mappings, even if we didn't know what that means in detail.

Yes I'm afraid I made a similar mistake when reviewing this, forgetting that
DEVICE_COHERENT pages are not LRU pages and therefore need special treatment in
some places. So for now I will have to withdraw my reviewed-by until this has
been looked at more closely, because as you note below accidentally treating
them as LRU pages leads to a bad time.

> I can now at least name some differences between DEVICE_COHERENT and normal
> pages: how the memory is allocated, how data is migrated into DEVICE_COHERENT
> pages and that it can't be on any LRU list (because the lru list_head in struct
> page is aliased by pgmap and zone_device_data). Maybe I'll find more differences
> if I keep digging.
>
> Regards,
>   Felix
>
>
>>
>> Jason
Felix Kuehling Feb. 18, 2022, 7:20 p.m. UTC | #30
Am 2022-02-17 um 19:19 schrieb Jason Gunthorpe:
> On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:
>
>> I'm thinking of a more theoretical approach: Instead of auditing all users,
>> I'd ask, what are the invariants that a vm_normal_page should have. Then
>> check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept
>> of a vm_normal_page isn't defined clearly enough for that.
> I would say the expectation is that only 'page cache and anon' pages
> are returned - ie the first union in struct page
>
> This is because the first file in your list I looked at:
>
> static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> 				unsigned long addr, unsigned long end,
> 				struct mm_walk *walk)
>
> {
> 		page = vm_normal_page(vma, addr, ptent);
> [..]
> 		if (pageout) {
> 			if (!isolate_lru_page(page)) {
>
> Uses the LRU field, so this is incompatible with all the other page
> types.
>
> One mitigation of this might be to formally make vm_normal_page() ==
> 'pte to page cache and anon page' and add a new function that is 'pte
> to any struct page'
>
> Then go through and sort callers as appropriate.
>
> The 'pte to page cache and anon page' can detect ZONE_DEVICE by
> calling is_zone_device_page() insted of pte_devmap() and then continue
> to return NULL. This same trick will fix GUP_fast.

Sounds good to me. What about vm_normal_page_pmd? Should we remove the 
pmd_devmap check from that function as well. I'm not even sure what a 
huge zone_device page would look like, but maybe that's a worthwhile 
future optimization for our driver.

I'd propose the function names vm_normal_page and 
vm_normal_or_device_page for the two functions you described. The latter 
would basically be the current vm_normal_page with the pte_devmap check 
removed. vm_normal_page could be implemented as a wrapper around 
vm_normal_or_device_page, which just adds the !is_zone_device_page() check.

Regards,
   Felix


>
> Jason
>
Jason Gunthorpe Feb. 18, 2022, 7:26 p.m. UTC | #31
On Fri, Feb 18, 2022 at 02:20:45PM -0500, Felix Kuehling wrote:
> Am 2022-02-17 um 19:19 schrieb Jason Gunthorpe:
> > On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:
> > 
> > > I'm thinking of a more theoretical approach: Instead of auditing all users,
> > > I'd ask, what are the invariants that a vm_normal_page should have. Then
> > > check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept
> > > of a vm_normal_page isn't defined clearly enough for that.
> > I would say the expectation is that only 'page cache and anon' pages
> > are returned - ie the first union in struct page
> > 
> > This is because the first file in your list I looked at:
> > 
> > static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> > 				unsigned long addr, unsigned long end,
> > 				struct mm_walk *walk)
> > 
> > {
> > 		page = vm_normal_page(vma, addr, ptent);
> > [..]
> > 		if (pageout) {
> > 			if (!isolate_lru_page(page)) {
> > 
> > Uses the LRU field, so this is incompatible with all the other page
> > types.
> > 
> > One mitigation of this might be to formally make vm_normal_page() ==
> > 'pte to page cache and anon page' and add a new function that is 'pte
> > to any struct page'
> > 
> > Then go through and sort callers as appropriate.
> > 
> > The 'pte to page cache and anon page' can detect ZONE_DEVICE by
> > calling is_zone_device_page() insted of pte_devmap() and then continue
> > to return NULL. This same trick will fix GUP_fast.
> 
> Sounds good to me. What about vm_normal_page_pmd? Should we remove the
> pmd_devmap check from that function as well. I'm not even sure what a huge
> zone_device page would look like, but maybe that's a worthwhile future
> optimization for our driver.

IIRC there are other problems here as PMDs are currently wired to THPs
and not general at all..

We have huge zone_device pages, it is just any compound page of
contiguous pfns - you should be aggregating any contiguous string of
logical PFNs together into a folio for performance. If the folio is
stuffed into a PMD or not is a different question..

> I'd propose the function names vm_normal_page and vm_normal_or_device_page
> for the two functions you described. 

I wouldn't say device_page, it should be any type of page - though
device_page is the only other option ATM, AFIAK.

> current vm_normal_page with the pte_devmap check removed. vm_normal_page
> could be implemented as a wrapper around vm_normal_or_device_page, which
> just adds the !is_zone_device_page() check.

Yes

Jason
Felix Kuehling Feb. 18, 2022, 7:37 p.m. UTC | #32
Am 2022-02-18 um 14:26 schrieb Jason Gunthorpe:
> On Fri, Feb 18, 2022 at 02:20:45PM -0500, Felix Kuehling wrote:
>> Am 2022-02-17 um 19:19 schrieb Jason Gunthorpe:
>>> On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:
>>>
>>>> I'm thinking of a more theoretical approach: Instead of auditing all users,
>>>> I'd ask, what are the invariants that a vm_normal_page should have. Then
>>>> check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept
>>>> of a vm_normal_page isn't defined clearly enough for that.
>>> I would say the expectation is that only 'page cache and anon' pages
>>> are returned - ie the first union in struct page
>>>
>>> This is because the first file in your list I looked at:
>>>
>>> static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>> 				unsigned long addr, unsigned long end,
>>> 				struct mm_walk *walk)
>>>
>>> {
>>> 		page = vm_normal_page(vma, addr, ptent);
>>> [..]
>>> 		if (pageout) {
>>> 			if (!isolate_lru_page(page)) {
>>>
>>> Uses the LRU field, so this is incompatible with all the other page
>>> types.
>>>
>>> One mitigation of this might be to formally make vm_normal_page() ==
>>> 'pte to page cache and anon page' and add a new function that is 'pte
>>> to any struct page'
>>>
>>> Then go through and sort callers as appropriate.
>>>
>>> The 'pte to page cache and anon page' can detect ZONE_DEVICE by
>>> calling is_zone_device_page() insted of pte_devmap() and then continue
>>> to return NULL. This same trick will fix GUP_fast.
>> Sounds good to me. What about vm_normal_page_pmd? Should we remove the
>> pmd_devmap check from that function as well. I'm not even sure what a huge
>> zone_device page would look like, but maybe that's a worthwhile future
>> optimization for our driver.
> IIRC there are other problems here as PMDs are currently wired to THPs
> and not general at all..
>
> We have huge zone_device pages, it is just any compound page of
> contiguous pfns - you should be aggregating any contiguous string of
> logical PFNs together into a folio for performance. If the folio is
> stuffed into a PMD or not is a different question..
>
>> I'd propose the function names vm_normal_page and vm_normal_or_device_page
>> for the two functions you described.
> I wouldn't say device_page, it should be any type of page - though
> device_page is the only other option ATM, AFIAK.

Ok, then how about vm_normal_lru_page and vm_normal_any_page?

Regards,
   Felix


>
>> current vm_normal_page with the pte_devmap check removed. vm_normal_page
>> could be implemented as a wrapper around vm_normal_or_device_page, which
>> just adds the !is_zone_device_page() check.
> Yes
>
> Jason
diff mbox series

Patch

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acba..727b8c789193 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -39,6 +39,13 @@  struct vmem_altmap {
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.rst.
  *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allowed to pin such memory so that it can always be evicted.
+ *
  * MEMORY_DEVICE_FS_DAX:
  * Host memory that has similar access semantics as System RAM i.e. DMA
  * coherent and supports page pinning. In support of coordinating page
@@ -59,6 +66,7 @@  struct vmem_altmap {
 enum memory_type {
 	/* 0 is reserved to catch uninitialized type fields */
 	MEMORY_DEVICE_PRIVATE = 1,
+	MEMORY_DEVICE_COHERENT,
 	MEMORY_DEVICE_FS_DAX,
 	MEMORY_DEVICE_GENERIC,
 	MEMORY_DEVICE_PCI_P2PDMA,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..545f41da9fe9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1106,6 +1106,7 @@  static inline bool page_is_devmap_managed(struct page *page)
 		return false;
 	switch (page->pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
 	case MEMORY_DEVICE_FS_DAX:
 		return true;
 	default:
@@ -1135,6 +1136,21 @@  static inline bool is_device_private_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+static inline bool is_device_coherent_page(const struct page *page)
+{
+	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+		is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_COHERENT;
+}
+
+static inline bool is_dev_private_or_coherent_page(const struct page *page)
+{
+	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+		is_zone_device_page(page) &&
+		(page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+		page->pgmap->type == MEMORY_DEVICE_COHERENT);
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09d342c7cbd0..0882b5b2a857 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5691,8 +5691,8 @@  static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
- *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
- *     (so ZONE_DEVICE page and thus not on the lru).
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is device memory and
+ *   thus not on the lru.
  *     For now we such page is charge like a regular page would be as for all
  *     intent and purposes it is just special memory taking the place of a
  *     regular page.
@@ -5726,7 +5726,7 @@  static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 */
 		if (page_memcg(page) == mc.from) {
 			ret = MC_TARGET_PAGE;
-			if (is_device_private_page(page))
+			if (is_dev_private_or_coherent_page(page))
 				ret = MC_TARGET_DEVICE;
 			if (target)
 				target->page = page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 97a9ed8f87a9..f498ed3ece79 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1617,12 +1617,16 @@  static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto unlock;
 	}
 
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
 		/*
-		 * TODO: Handle HMM pages which may need coordination
+		 * TODO: Handle device pages which may need coordination
 		 * with device-side memory.
 		 */
 		goto unlock;
+	default:
+		break;
 	}
 
 	/*
diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11f..354c8027823f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -44,6 +44,7 @@  EXPORT_SYMBOL(devmap_managed_key);
 static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 {
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+	    pgmap->type == MEMORY_DEVICE_COHERENT ||
 	    pgmap->type == MEMORY_DEVICE_FS_DAX)
 		static_branch_dec(&devmap_managed_key);
 }
@@ -51,6 +52,7 @@  static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+	    pgmap->type == MEMORY_DEVICE_COHERENT ||
 	    pgmap->type == MEMORY_DEVICE_FS_DAX)
 		static_branch_inc(&devmap_managed_key);
 }
@@ -327,6 +329,16 @@  void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 			return ERR_PTR(-EINVAL);
 		}
 		break;
+	case MEMORY_DEVICE_COHERENT:
+		if (!pgmap->ops->page_free) {
+			WARN(1, "Missing page_free method\n");
+			return ERR_PTR(-EINVAL);
+		}
+		if (!pgmap->owner) {
+			WARN(1, "Missing owner\n");
+			return ERR_PTR(-EINVAL);
+		}
+		break;
 	case MEMORY_DEVICE_FS_DAX:
 		if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
 		    IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
@@ -469,7 +481,7 @@  EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_devmap_managed_page(struct page *page)
 {
 	/* notify page idle for dax */
-	if (!is_device_private_page(page)) {
+	if (!is_dev_private_or_coherent_page(page)) {
 		wake_up_var(&page->_refcount);
 		return;
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..cd137aedcfe5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -345,7 +345,7 @@  static int expected_page_refs(struct address_space *mapping, struct page *page)
 	 * Device private pages have an extra refcount as they are
 	 * ZONE_DEVICE pages.
 	 */
-	expected_count += is_device_private_page(page);
+	expected_count += is_dev_private_or_coherent_page(page);
 	if (mapping)
 		expected_count += compound_nr(page) + page_has_private(page);
 
@@ -2611,7 +2611,7 @@  EXPORT_SYMBOL(migrate_vma_setup);
  *     handle_pte_fault()
  *       do_anonymous_page()
  * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * private or coherent page.
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
@@ -2676,25 +2676,24 @@  static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	 */
 	__SetPageUptodate(page);
 
-	if (is_zone_device_page(page)) {
-		if (is_device_private_page(page)) {
-			swp_entry_t swp_entry;
+	if (is_device_private_page(page)) {
+		swp_entry_t swp_entry;
 
-			if (vma->vm_flags & VM_WRITE)
-				swp_entry = make_writable_device_private_entry(
-							page_to_pfn(page));
-			else
-				swp_entry = make_readable_device_private_entry(
-							page_to_pfn(page));
-			entry = swp_entry_to_pte(swp_entry);
-		} else {
-			/*
-			 * For now we only support migrating to un-addressable
-			 * device memory.
-			 */
-			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
-			goto abort;
-		}
+		if (vma->vm_flags & VM_WRITE)
+			swp_entry = make_writable_device_private_entry(
+						page_to_pfn(page));
+		else
+			swp_entry = make_readable_device_private_entry(
+						page_to_pfn(page));
+		entry = swp_entry_to_pte(swp_entry);
+	} else if (is_zone_device_page(page) &&
+		   !is_device_coherent_page(page)) {
+		/*
+		 * We support migrating to private and coherent types
+		 * for device zone memory.
+		 */
+		pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
+		goto abort;
 	} else {
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (vma->vm_flags & VM_WRITE)
@@ -2796,10 +2795,10 @@  void migrate_vma_pages(struct migrate_vma *migrate)
 		mapping = page_mapping(page);
 
 		if (is_zone_device_page(newpage)) {
-			if (is_device_private_page(newpage)) {
+			if (is_dev_private_or_coherent_page(newpage)) {
 				/*
-				 * For now only support private anonymous when
-				 * migrating to un-addressable device memory.
+				 * For now only support private and coherent
+				 * anonymous when migrating to device memory.
 				 */
 				if (mapping) {
 					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..f2bd5a3da72c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1835,7 +1835,7 @@  static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
 
-		if (is_zone_device_page(page)) {
+		if (is_device_private_page(page)) {
 			unsigned long pfn = page_to_pfn(page);
 			swp_entry_t entry;
 			pte_t swp_pte;
@@ -1976,7 +1976,8 @@  void try_to_migrate(struct page *page, enum ttu_flags flags)
 					TTU_SYNC)))
 		return;
 
-	if (is_zone_device_page(page) && !is_device_private_page(page))
+	if (is_zone_device_page(page) &&
+	    !is_dev_private_or_coherent_page(page))
 		return;
 
 	/*