diff mbox series

[v8,02/14] mm/gup: check every subpage of a compound page during isolation

Message ID 20210125194751.1275316-3-pasha.tatashin@soleen.com (mailing list archive)
State New, archived
Headers show
Series prohibit pinning pages in ZONE_MOVABLE | expand

Commit Message

Pasha Tatashin Jan. 25, 2021, 7:47 p.m. UTC
When pages are isolated in check_and_migrate_movable_pages() we skip
compound number of pages at a time. However, as Jason noted, it is
not necessary correct that pages[i] corresponds to the pages that
we skipped. This is because it is possible that the addresses in
this range had split_huge_pmd()/split_huge_pud(), and these functions
do not update the compound page metadata.

The problem can be reproduced if something like this occurs:

1. User faulted huge pages.
2. split_huge_pmd() was called for some reason
3. User has unmapped some sub-pages in the range
4. User tries to longterm pin the addresses.

The resulting pages[i] might end-up having pages which are not compound
size page aligned.

Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge page")
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
 mm/gup.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

Comments

Joao Martins Feb. 3, 2021, 1:22 p.m. UTC | #1
On 1/25/21 7:47 PM, Pavel Tatashin wrote:
> When pages are isolated in check_and_migrate_movable_pages() we skip
> compound number of pages at a time. However, as Jason noted, it is
> not necessary correct that pages[i] corresponds to the pages that
> we skipped. This is because it is possible that the addresses in
> this range had split_huge_pmd()/split_huge_pud(), and these functions
> do not update the compound page metadata.
> 
> The problem can be reproduced if something like this occurs:
> 
> 1. User faulted huge pages.
> 2. split_huge_pmd() was called for some reason
> 3. User has unmapped some sub-pages in the range
> 4. User tries to longterm pin the addresses.
> 
> The resulting pages[i] might end-up having pages which are not compound
> size page aligned.
> 
> Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge page")
> Reported-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> ---

[...]

>  		/*
>  		 * If we get a page from the CMA zone, since we are going to
>  		 * be pinning these entries, we might as well move them out
> @@ -1599,8 +1596,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>  				}
>  			}
>  		}
> -
> -		i += step;
>  	}
>  

With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> 32k usecs when
pinning a 16G hugetlb file.

Splitting can only occur on THP right? If so, perhaps we could retain the @step increment
for compound pages but when !is_transparent_hugepage(head) or just PageHuge(head) like:

+               if (!is_transparent_hugepage(head) && PageCompound(page))
+                       i += (compound_nr(head) - (pages[i] - head));

Or making specific to hugetlbfs:

+               if (PageHuge(head))
+                       i += (compound_nr(head) - (pages[i] - head));
Pasha Tatashin Feb. 3, 2021, 2:51 p.m. UTC | #2
On Wed, Feb 3, 2021 at 8:23 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 1/25/21 7:47 PM, Pavel Tatashin wrote:
> > When pages are isolated in check_and_migrate_movable_pages() we skip
> > compound number of pages at a time. However, as Jason noted, it is
> > not necessary correct that pages[i] corresponds to the pages that
> > we skipped. This is because it is possible that the addresses in
> > this range had split_huge_pmd()/split_huge_pud(), and these functions
> > do not update the compound page metadata.
> >
> > The problem can be reproduced if something like this occurs:
> >
> > 1. User faulted huge pages.
> > 2. split_huge_pmd() was called for some reason
> > 3. User has unmapped some sub-pages in the range
> > 4. User tries to longterm pin the addresses.
> >
> > The resulting pages[i] might end-up having pages which are not compound
> > size page aligned.
> >
> > Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge page")
> > Reported-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > ---
>
> [...]
>
> >               /*
> >                * If we get a page from the CMA zone, since we are going to
> >                * be pinning these entries, we might as well move them out
> > @@ -1599,8 +1596,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> >                               }
> >                       }
> >               }
> > -
> > -             i += step;
> >       }
> >
>

Hi Joao,

> With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> 32k usecs when
> pinning a 16G hugetlb file.

Estimate or you actually measured?

>

> Splitting can only occur on THP right? If so, perhaps we could retain the @step increment

Yes, I do not think we can split HugePage, only THP.

> for compound pages but when !is_transparent_hugepage(head) or just PageHuge(head) like:
>
> +               if (!is_transparent_hugepage(head) && PageCompound(page))
> +                       i += (compound_nr(head) - (pages[i] - head));
>
> Or making specific to hugetlbfs:
>
> +               if (PageHuge(head))
> +                       i += (compound_nr(head) - (pages[i] - head));

Yes, this is reasonable optimization. I will submit a follow up patch
against linux-next.

Thank you,
Pasha
Jason Gunthorpe Feb. 3, 2021, 2:53 p.m. UTC | #3
On Wed, Feb 03, 2021 at 01:22:18PM +0000, Joao Martins wrote:

> With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> 32k usecs when
> pinning a 16G hugetlb file.

Yes, but correctness demands it.

The solution is to track these pages as we discover them so we know if
a PMD/PUD points and can directly skip the duplicated work

> Splitting can only occur on THP right? If so, perhaps we could
> retain the @step increment for compound pages but when
> !is_transparent_hugepage(head) or just PageHuge(head) like:

Honestly I'd rather see it fixed properly which will give even bigger
performance gains - avoiding the entire rescan of the page list will
be a win

Jason
Joao Martins Feb. 3, 2021, 3:32 p.m. UTC | #4
On 2/3/21 2:51 PM, Pavel Tatashin wrote:
> On Wed, Feb 3, 2021 at 8:23 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 1/25/21 7:47 PM, Pavel Tatashin wrote:
>>> When pages are isolated in check_and_migrate_movable_pages() we skip
>>> compound number of pages at a time. However, as Jason noted, it is
>>> not necessary correct that pages[i] corresponds to the pages that
>>> we skipped. This is because it is possible that the addresses in
>>> this range had split_huge_pmd()/split_huge_pud(), and these functions
>>> do not update the compound page metadata.
>>>
>>> The problem can be reproduced if something like this occurs:
>>>
>>> 1. User faulted huge pages.
>>> 2. split_huge_pmd() was called for some reason
>>> 3. User has unmapped some sub-pages in the range
>>> 4. User tries to longterm pin the addresses.
>>>
>>> The resulting pages[i] might end-up having pages which are not compound
>>> size page aligned.
>>>
>>> Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge page")
>>> Reported-by: Jason Gunthorpe <jgg@nvidia.com>
>>> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>> ---
>>
>> [...]
>>
>>>               /*
>>>                * If we get a page from the CMA zone, since we are going to
>>>                * be pinning these entries, we might as well move them out
>>> @@ -1599,8 +1596,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>>>                               }
>>>                       }
>>>               }
>>> -
>>> -             i += step;
>>>       }
>>>
>>
> 
> Hi Joao,
> 
>> With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> ~32k usecs when
>> pinning a 16G hugetlb file.
> 
> Estimate or you actually measured?
> 
It's what I had measured before sending. The ~ is because there's error variance.

>>
> 
>> Splitting can only occur on THP right? If so, perhaps we could retain the @step increment
> 
> Yes, I do not think we can split HugePage, only THP.
> 
Right, that's my impression too.

>> for compound pages but when !is_transparent_hugepage(head) or just PageHuge(head) like:
>>
>> +               if (!is_transparent_hugepage(head) && PageCompound(page))
>> +                       i += (compound_nr(head) - (pages[i] - head));
>>
>> Or making specific to hugetlbfs:
>>
>> +               if (PageHuge(head))
>> +                       i += (compound_nr(head) - (pages[i] - head));
> 
> Yes, this is reasonable optimization. I will submit a follow up patch
> against linux-next.
Joao Martins Feb. 3, 2021, 4:13 p.m. UTC | #5
On 2/3/21 2:53 PM, Jason Gunthorpe wrote:
> On Wed, Feb 03, 2021 at 01:22:18PM +0000, Joao Martins wrote:
> 
>> With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> ~32k usecs when
>> pinning a 16G hugetlb file.
> 
> Yes, but correctness demands it.
> 
Hence the quotes around the word regress.

But still, we are adding unnecessary overhead (previously nonexistent) to compound pages
even those where the issue doesn't exist (hugetlb).

> The solution is to track these pages as we discover them so we know if
> a PMD/PUD points and can directly skip the duplicated work
> 
That looks better. It would require moving check_and_migrate_movable_pages() elsewhere,
and/or possibly reworking the entire series?

>> Splitting can only occur on THP right? If so, perhaps we could
>> retain the @step increment for compound pages but when
>> !is_transparent_hugepage(head) or just PageHuge(head) like:
> 
> Honestly I'd rather see it fixed properly which will give even bigger
> performance gains - avoiding the entire rescan of the page list will
> be a win
> 

If check_and_migrate_movable_pages() is meant to migrate unpinned pages, then rather than
pinning+unpinning+moving, perhaps it would be called in __get_user_pages() place where we
are walking page tables and know if it's a PUD/PMD and can skip all the subpages and just
record and migrate those instead. Was that your thinking?

	Joao
Joao Martins Feb. 3, 2021, 4:46 p.m. UTC | #6
On 2/3/21 3:32 PM, Joao Martins wrote:
> On 2/3/21 2:51 PM, Pavel Tatashin wrote:
>> On Wed, Feb 3, 2021 at 8:23 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 1/25/21 7:47 PM, Pavel Tatashin wrote:
>>> for compound pages but when !is_transparent_hugepage(head) or just PageHuge(head) like:
>>>
>>> +               if (!is_transparent_hugepage(head) && PageCompound(page))
>>> +                       i += (compound_nr(head) - (pages[i] - head));
>>>
>>> Or making specific to hugetlbfs:
>>>
>>> +               if (PageHuge(head))
>>> +                       i += (compound_nr(head) - (pages[i] - head));
>>
>> Yes, this is reasonable optimization. I will submit a follow up patch
>> against linux-next.

Realized it late, but the previous step was already broken. And I inherited its brokeness,
when copy-pasting the deleted chunk:

The @step should be capped at the remaining pages to iterate:

	i += min(nr_pages - i, compound_nr(head) - (pages[i] - head));

  Joao
Jason Gunthorpe Feb. 3, 2021, 6 p.m. UTC | #7
On Wed, Feb 03, 2021 at 04:13:21PM +0000, Joao Martins wrote:

> If check_and_migrate_movable_pages() is meant to migrate unpinned
> pages, then rather than pinning+unpinning+moving, perhaps it would
> be called in __get_user_pages() place where we are walking page
> tables and know if it's a PUD/PMD and can skip all the subpages and
> just record and migrate those instead. Was that your thinking?

I think a reasonable approach is to detect non-pinnable pages while
walking the VMAs, when found isolate them and thread them on a linked
list.

When the VMA walk is done you'll have a linked list of isolated pages
that need migration. So the check_and_migrate_movable_pages() gets
split into the top half being diffused in the VMA walk and the bottom
half still called after __get_user_pages() returns.

Jason
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 24f25b1e9103..16f10d5a9eb6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1556,26 +1556,23 @@  static long check_and_migrate_cma_pages(struct mm_struct *mm,
 					unsigned int gup_flags)
 {
 	unsigned long i;
-	unsigned long step;
 	bool drain_allow = true;
 	bool migrate_allow = true;
 	LIST_HEAD(cma_page_list);
 	long ret = nr_pages;
+	struct page *prev_head, *head;
 	struct migration_target_control mtc = {
 		.nid = NUMA_NO_NODE,
 		.gfp_mask = GFP_USER | __GFP_NOWARN,
 	};
 
 check_again:
-	for (i = 0; i < nr_pages;) {
-
-		struct page *head = compound_head(pages[i]);
-
-		/*
-		 * gup may start from a tail page. Advance step by the left
-		 * part.
-		 */
-		step = compound_nr(head) - (pages[i] - head);
+	prev_head = NULL;
+	for (i = 0; i < nr_pages; i++) {
+		head = compound_head(pages[i]);
+		if (head == prev_head)
+			continue;
+		prev_head = head;
 		/*
 		 * If we get a page from the CMA zone, since we are going to
 		 * be pinning these entries, we might as well move them out
@@ -1599,8 +1596,6 @@  static long check_and_migrate_cma_pages(struct mm_struct *mm,
 				}
 			}
 		}
-
-		i += step;
 	}
 
 	if (!list_empty(&cma_page_list)) {