diff mbox series

[v1,13/15] mm: handling Non-LRU pages returned by vm_normal_pages

Message ID 20220505213438.25064-14-alex.sierra@amd.com (mailing list archive)
State Superseded
Headers show
Series Add MEMORY_DEVICE_COHERENT for coherent device memory mapping | expand

Commit Message

Sierra Guiza, Alejandro (Alex) May 5, 2022, 9:34 p.m. UTC
With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
device-managed anonymous pages that are not LRU pages. Although they
behave like normal pages for purposes of mapping in CPU page, and for
COW. They do not support LRU lists, NUMA migration or THP.

We also introduced a FOLL_LRU flag that adds the same behaviour to
follow_page and related APIs, to allow callers to specify that they
expect to put pages on an LRU list.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 fs/proc/task_mmu.c | 2 +-
 include/linux/mm.h | 3 ++-
 mm/gup.c           | 2 ++
 mm/huge_memory.c   | 2 +-
 mm/khugepaged.c    | 9 ++++++---
 mm/ksm.c           | 6 +++---
 mm/madvise.c       | 4 ++--
 mm/memory.c        | 9 ++++++++-
 mm/mempolicy.c     | 2 +-
 mm/migrate.c       | 4 ++--
 mm/mlock.c         | 2 +-
 mm/mprotect.c      | 2 +-
 12 files changed, 30 insertions(+), 17 deletions(-)

Comments

Jason Gunthorpe May 11, 2022, 6:50 p.m. UTC | #1
On Thu, May 05, 2022 at 04:34:36PM -0500, Alex Sierra wrote:

> diff --git a/mm/memory.c b/mm/memory.c
> index 76e3af9639d9..892c4cc54dc2 100644
> +++ b/mm/memory.c
> @@ -621,6 +621,13 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>  		if (is_zero_pfn(pfn))
>  			return NULL;
>  		if (pte_devmap(pte))
> +/*
> + * NOTE: Technically this should goto check_pfn label. However, page->_mapcount
> + * is never incremented for device pages that are mmap through DAX mechanism
> + * using pmem driver mounted into ext4 filesystem. When these pages are unmap,
> + * zap_pte_range is called and vm_normal_page return a valid page with
> + * page_mapcount() = 0, before page_remove_rmap is called.
> + */
>  			return NULL;

? Where does this series cause device coherent to be returned?

Wasn't the plan to not set pte_devmap() ?

Jason
Sierra Guiza, Alejandro (Alex) May 12, 2022, 10:33 p.m. UTC | #2
On 5/11/2022 1:50 PM, Jason Gunthorpe wrote:
> On Thu, May 05, 2022 at 04:34:36PM -0500, Alex Sierra wrote:
>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 76e3af9639d9..892c4cc54dc2 100644
>> +++ b/mm/memory.c
>> @@ -621,6 +621,13 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>>   		if (is_zero_pfn(pfn))
>>   			return NULL;
>>   		if (pte_devmap(pte))
>> +/*
>> + * NOTE: Technically this should goto check_pfn label. However, page->_mapcount
>> + * is never incremented for device pages that are mmap through DAX mechanism
>> + * using pmem driver mounted into ext4 filesystem. When these pages are unmap,
>> + * zap_pte_range is called and vm_normal_page return a valid page with
>> + * page_mapcount() = 0, before page_remove_rmap is called.
>> + */
>>   			return NULL;
> ? Where does this series cause device coherent to be returned?
In our case, device coherent pages could be obtained as a result of 
migration(Patches 6/7 of 15), ending up mapped in CPU page tables. Later 
on, these pages might need to be returned by get_user_pages or other 
callers through vm_normal_pages. Our approach in this series, is to 
handle device-coherent-managed pages returned by vm_normal_pages, inside 
each caller. EX. device coherent pages don’t support LRU lists, NUMA 
migration or THP.
>
> Wasn't the plan to not set pte_devmap() ?

amdgpu does not set pte_devmap for our DEVICE_COHERENT pages. DEVMAP 
flags are set by drivers like virtio_fs or pmem, where 
MEMORY_DEVICE_FS_DAX type is used.
This patch series deals with DEVICE_COHERENT pages. My understanding 
was, that the DAX code and DEVICE_GENERIC would be fixed up later by 
someone more familiar with it. Were you expecting that we'd fix the DAX 
usage of pte_devmap flags in this patch series as well?

Regards,
Alex Sierra

>
> Jason
Jason Gunthorpe May 13, 2022, 11:45 a.m. UTC | #3
On Thu, May 12, 2022 at 05:33:44PM -0500, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 5/11/2022 1:50 PM, Jason Gunthorpe wrote:
> > On Thu, May 05, 2022 at 04:34:36PM -0500, Alex Sierra wrote:
> > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 76e3af9639d9..892c4cc54dc2 100644
> > > +++ b/mm/memory.c
> > > @@ -621,6 +621,13 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > >   		if (is_zero_pfn(pfn))
> > >   			return NULL;
> > >   		if (pte_devmap(pte))
> > > +/*
> > > + * NOTE: Technically this should goto check_pfn label. However, page->_mapcount
> > > + * is never incremented for device pages that are mmap through DAX mechanism
> > > + * using pmem driver mounted into ext4 filesystem. When these pages are unmap,
> > > + * zap_pte_range is called and vm_normal_page return a valid page with
> > > + * page_mapcount() = 0, before page_remove_rmap is called.
> > > + */
> > >   			return NULL;
> > ? Where does this series cause device coherent to be returned?
> In our case, device coherent pages could be obtained as a result of
> migration(Patches 6/7 of 15), ending up mapped in CPU page tables. Later on,
> these pages might need to be returned by get_user_pages or other callers
> through vm_normal_pages. Our approach in this series, is to handle
> device-coherent-managed pages returned by vm_normal_pages, inside each
> caller. EX. device coherent pages don’t support LRU lists, NUMA migration or
> THP.
> > 
> > Wasn't the plan to not set pte_devmap() ?
> 
> amdgpu does not set pte_devmap for our DEVICE_COHERENT pages. DEVMAP flags
> are set by drivers like virtio_fs or pmem, where MEMORY_DEVICE_FS_DAX type
> is used.
> This patch series deals with DEVICE_COHERENT pages. My understanding was,
> that the DAX code and DEVICE_GENERIC would be fixed up later by someone more
> familiar with it. Were you expecting that we'd fix the DAX usage of
> pte_devmap flags in this patch series as well?

No, I was just trying to find where the pages got inserted and
understand the comment above. I think the comment should be clarified
more like you explained:

  New uers of ZONE_DEVICE will not set pte_devmap() and will have
  refcounts incremented on their struct pages when they are inserted
  into PTEs, thus they are safe to return here. Legacy ZONE_DEVICE
  pages that set pte_devmap() do not have refcounts. ....

Jason
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f46060eb91b5..5d620733f173 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1785,7 +1785,7 @@  static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
 		return NULL;
 
 	page = vm_normal_page(vma, addr, pte);
-	if (!page)
+	if (!page || is_zone_device_page(page))
 		return NULL;
 
 	if (PageReserved(page))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9f44254af8ce..d7f253a0c41e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -601,7 +601,7 @@  struct vm_operations_struct {
 #endif
 	/*
 	 * Called by vm_normal_page() for special PTEs to find the
-	 * page for @addr.  This is useful if the default behavior
+	 * page for @addr. This is useful if the default behavior
 	 * (using pte_page()) would not find the correct page.
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
@@ -2929,6 +2929,7 @@  struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
+#define FOLL_LRU        0x1000  /* return only LRU (anon or page cache) */
 #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
 #define FOLL_COW	0x4000	/* internal GUP flag */
 #define FOLL_ANON	0x8000	/* don't do file mappings */
diff --git a/mm/gup.c b/mm/gup.c
index a214c8df7140..13bf04bce132 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -479,6 +479,8 @@  static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	page = vm_normal_page(vma, address, pte);
+	if ((flags & FOLL_LRU) && page && is_zone_device_page(page))
+		page = NULL;
 	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 		/*
 		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c468fee595ff..dc572bc9fa1c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2851,7 +2851,7 @@  static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		}
 
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP | FOLL_LRU);
 
 		if (IS_ERR(page))
 			continue;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a4e5eaf3eb01..eb3cfd679800 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -627,7 +627,7 @@  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 		page = vm_normal_page(vma, address, pteval);
-		if (unlikely(!page)) {
+		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
 			result = SCAN_PAGE_NULL;
 			goto out;
 		}
@@ -1276,7 +1276,7 @@  static int khugepaged_scan_pmd(struct mm_struct *mm,
 			writable = true;
 
 		page = vm_normal_page(vma, _address, pteval);
-		if (unlikely(!page)) {
+		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
 			result = SCAN_PAGE_NULL;
 			goto out_unmap;
 		}
@@ -1484,7 +1484,8 @@  void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 			goto abort;
 
 		page = vm_normal_page(vma, addr, *pte);
-
+		if (page && is_zone_device_page(page))
+			page = NULL;
 		/*
 		 * Note that uprobe, debugger, or MAP_PRIVATE may change the
 		 * page table, but the new page will not be a subpage of hpage.
@@ -1502,6 +1503,8 @@  void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		if (pte_none(*pte))
 			continue;
 		page = vm_normal_page(vma, addr, *pte);
+		if (page && is_zone_device_page(page))
+			goto abort;
 		page_remove_rmap(page, vma, false);
 	}
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 063a48eeb5ee..f16056efca21 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -474,7 +474,7 @@  static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 	do {
 		cond_resched();
 		page = follow_page(vma, addr,
-				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
+				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU);
 		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
@@ -559,7 +559,7 @@  static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 	if (!vma)
 		goto out;
 
-	page = follow_page(vma, addr, FOLL_GET);
+	page = follow_page(vma, addr, FOLL_GET | FOLL_LRU);
 	if (IS_ERR_OR_NULL(page))
 		goto out;
 	if (PageAnon(page)) {
@@ -2288,7 +2288,7 @@  static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 		while (ksm_scan.address < vma->vm_end) {
 			if (ksm_test_exit(mm))
 				break;
-			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
+			*page = follow_page(vma, ksm_scan.address, FOLL_GET | FOLL_LRU);
 			if (IS_ERR_OR_NULL(*page)) {
 				ksm_scan.address += PAGE_SIZE;
 				cond_resched();
diff --git a/mm/madvise.c b/mm/madvise.c
index 1873616a37d2..e9c24c834e98 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -413,7 +413,7 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			continue;
 
 		page = vm_normal_page(vma, addr, ptent);
-		if (!page)
+		if (!page || is_zone_device_page(page))
 			continue;
 
 		/*
@@ -628,7 +628,7 @@  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		}
 
 		page = vm_normal_page(vma, addr, ptent);
-		if (!page)
+		if (!page || is_zone_device_page(page))
 			continue;
 
 		/*
diff --git a/mm/memory.c b/mm/memory.c
index 76e3af9639d9..892c4cc54dc2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -621,6 +621,13 @@  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		if (is_zero_pfn(pfn))
 			return NULL;
 		if (pte_devmap(pte))
+/*
+ * NOTE: Technically this should goto check_pfn label. However, page->_mapcount
+ * is never incremented for device pages that are mmap through DAX mechanism
+ * using pmem driver mounted into ext4 filesystem. When these pages are unmap,
+ * zap_pte_range is called and vm_normal_page return a valid page with
+ * page_mapcount() = 0, before page_remove_rmap is called.
+ */
 			return NULL;
 
 		print_bad_pte(vma, addr, pte, NULL);
@@ -4422,7 +4429,7 @@  static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	pte = pte_modify(old_pte, vma->vm_page_prot);
 
 	page = vm_normal_page(vma, vmf->address, pte);
-	if (!page)
+	if (!page || is_zone_device_page(page))
 		goto out_map;
 
 	/* TODO: handle PTE-mapped THP */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8c74107a2b15..e32edbecb0cd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -528,7 +528,7 @@  static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!pte_present(*pte))
 			continue;
 		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
+		if (!page || is_zone_device_page(page))
 			continue;
 		/*
 		 * vm_normal_page() filters out zero pages, but there might
diff --git a/mm/migrate.c b/mm/migrate.c
index e18ddee56f37..ed7ea77de539 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1612,7 +1612,7 @@  static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 		goto out;
 
 	/* FOLL_DUMP to ignore special (like zero) pages */
-	page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+	page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP | FOLL_LRU);
 
 	err = PTR_ERR(page);
 	if (IS_ERR(page))
@@ -1803,7 +1803,7 @@  static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 			goto set_status;
 
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_DUMP);
+		page = follow_page(vma, addr, FOLL_DUMP | FOLL_LRU);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
diff --git a/mm/mlock.c b/mm/mlock.c
index 716caf851043..b14e929084cc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -333,7 +333,7 @@  static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!pte_present(*pte))
 			continue;
 		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
+		if (!page || is_zone_device_page(page))
 			continue;
 		if (PageTransCompound(page))
 			continue;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b69ce7a7b2b7..a6f3587ea29a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -91,7 +91,7 @@  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					continue;
 
 				page = vm_normal_page(vma, addr, oldpte);
-				if (!page || PageKsm(page))
+				if (!page || is_zone_device_page(page) || PageKsm(page))
 					continue;
 
 				/* Also skip shared copy-on-write pages */