diff mbox series

[2/7] mm/khugepaged: stop swapping in page when VM_FAULT_RETRY occurs

Message ID 20220611084731.55155-3-linmiaohe@huawei.com (mailing list archive)
State New
Headers show
Series A few cleanup patches for khugepaged | expand

Commit Message

Miaohe Lin June 11, 2022, 8:47 a.m. UTC
When do_swap_page returns VM_FAULT_RETRY, we do not retry here and thus
swap entry will remain in pagetable. This will result in later failure.
So stop swapping in pages in this case to save cpu cycles.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/khugepaged.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

Comments

Zach O'Keefe June 15, 2022, 3:14 p.m. UTC | #1
On 11 Jun 16:47, Miaohe Lin wrote:
> When do_swap_page returns VM_FAULT_RETRY, we do not retry here and thus
> swap entry will remain in pagetable. This will result in later failure.
> So stop swapping in pages in this case to save cpu cycles.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/khugepaged.c | 19 ++++++++-----------
>  1 file changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 73570dfffcec..a8adb2d1e9c6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1003,19 +1003,16 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>  		swapped_in++;
>  		ret = do_swap_page(&vmf);
>  
> -		/* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> +		/*
> +		 * do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
> +		 * Note we treat VM_FAULT_RETRY as VM_FAULT_ERROR here because
> +		 * we do not retry here and swap entry will remain in pagetable
> +		 * resulting in later failure.
> +		 */
>  		if (ret & VM_FAULT_RETRY) {
>  			mmap_read_lock(mm);
> -			if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> -				/* vma is no longer available, don't continue to swapin */
> -				trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> -				return false;
> -			}
> -			/* check if the pmd is still valid */
> -			if (mm_find_pmd(mm, haddr) != pmd) {
> -				trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> -				return false;
> -			}
> +			trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> +			return false;
>  		}
>  		if (ret & VM_FAULT_ERROR) {
>  			trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> -- 
> 2.23.0
> 
>

I've convinced myself this is correct, but don't understand how we got here.
AFAICT, we've always continued to fault in pages, and, as you mention, don't
retry ones that have failed with VM_FAULT_RETRY - so
__collapse_huge_page_isolate() should fail. I don't think (?) there is any
benefit to continuing to swap if we don't handle VM_FAULT_RETRY appropriately.

So, I think this change looks good from that perspective. I suppose the only
other question would be: should we handle the VM_FAULT_RETRY case? Maybe 1
additional attempt then fail? AFAIK, this mostly (?) happens when the page is
locked.  Maybe it's not worth the extra complexity though..
Yang Shi June 15, 2022, 5:49 p.m. UTC | #2
On Sat, Jun 11, 2022 at 1:47 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> When do_swap_page returns VM_FAULT_RETRY, we do not retry here and thus
> swap entry will remain in pagetable. This will result in later failure.
> So stop swapping in pages in this case to save cpu cycles.
>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/khugepaged.c | 19 ++++++++-----------
>  1 file changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 73570dfffcec..a8adb2d1e9c6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1003,19 +1003,16 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>                 swapped_in++;
>                 ret = do_swap_page(&vmf);
>
> -               /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> +               /*
> +                * do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
> +                * Note we treat VM_FAULT_RETRY as VM_FAULT_ERROR here because
> +                * we do not retry here and swap entry will remain in pagetable
> +                * resulting in later failure.

Yeah, it makes sense.

> +                */
>                 if (ret & VM_FAULT_RETRY) {
>                         mmap_read_lock(mm);

A further optimization, you should not need to relock mmap_lock. You
may consider returning a different value or passing in *locked and
setting it to false, then check this value in the caller to skip
unlock.

> -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> -                               /* vma is no longer available, don't continue to swapin */
> -                               trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> -                               return false;
> -                       }
> -                       /* check if the pmd is still valid */
> -                       if (mm_find_pmd(mm, haddr) != pmd) {
> -                               trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> -                               return false;
> -                       }
> +                       trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> +                       return false;
>                 }
>                 if (ret & VM_FAULT_ERROR) {
>                         trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);

And I think "swapped_in++" needs to be moved after error handling.

> --
> 2.23.0
>
>
Yang Shi June 15, 2022, 5:51 p.m. UTC | #3
On Wed, Jun 15, 2022 at 8:14 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On 11 Jun 16:47, Miaohe Lin wrote:
> > When do_swap_page returns VM_FAULT_RETRY, we do not retry here and thus
> > swap entry will remain in pagetable. This will result in later failure.
> > So stop swapping in pages in this case to save cpu cycles.
> >
> > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > ---
> >  mm/khugepaged.c | 19 ++++++++-----------
> >  1 file changed, 8 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 73570dfffcec..a8adb2d1e9c6 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1003,19 +1003,16 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> >               swapped_in++;
> >               ret = do_swap_page(&vmf);
> >
> > -             /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> > +             /*
> > +              * do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
> > +              * Note we treat VM_FAULT_RETRY as VM_FAULT_ERROR here because
> > +              * we do not retry here and swap entry will remain in pagetable
> > +              * resulting in later failure.
> > +              */
> >               if (ret & VM_FAULT_RETRY) {
> >                       mmap_read_lock(mm);
> > -                     if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> > -                             /* vma is no longer available, don't continue to swapin */
> > -                             trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > -                             return false;
> > -                     }
> > -                     /* check if the pmd is still valid */
> > -                     if (mm_find_pmd(mm, haddr) != pmd) {
> > -                             trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > -                             return false;
> > -                     }
> > +                     trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > +                     return false;
> >               }
> >               if (ret & VM_FAULT_ERROR) {
> >                       trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > --
> > 2.23.0
> >
> >
>
> I've convinced myself this is correct, but don't understand how we got here.
> AFAICT, we've always continued to fault in pages, and, as you mention, don't
> retry ones that have failed with VM_FAULT_RETRY - so
> __collapse_huge_page_isolate() should fail. I don't think (?) there is any
> benefit to continuing to swap if we don't handle VM_FAULT_RETRY appropriately.
>
> So, I think this change looks good from that perspective. I suppose the only
> other question would be: should we handle the VM_FAULT_RETRY case? Maybe 1
> additional attempt then fail? AFAIK, this mostly (?) happens when the page is
> locked.  Maybe it's not worth the extra complexity though..

It should be unnecessary for khugepaged IMHO since it will scan all
the valid mm periodically, so it will come back eventually.

>
Miaohe Lin June 16, 2022, 6:08 a.m. UTC | #4
On 2022/6/16 1:51, Yang Shi wrote:
> On Wed, Jun 15, 2022 at 8:14 AM Zach O'Keefe <zokeefe@google.com> wrote:
>>
>> On 11 Jun 16:47, Miaohe Lin wrote:
>>> When do_swap_page returns VM_FAULT_RETRY, we do not retry here and thus
>>> swap entry will remain in pagetable. This will result in later failure.
>>> So stop swapping in pages in this case to save cpu cycles.
>>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>>  mm/khugepaged.c | 19 ++++++++-----------
>>>  1 file changed, 8 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 73570dfffcec..a8adb2d1e9c6 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1003,19 +1003,16 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>>>               swapped_in++;
>>>               ret = do_swap_page(&vmf);
>>>
>>> -             /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
>>> +             /*
>>> +              * do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
>>> +              * Note we treat VM_FAULT_RETRY as VM_FAULT_ERROR here because
>>> +              * we do not retry here and swap entry will remain in pagetable
>>> +              * resulting in later failure.
>>> +              */
>>>               if (ret & VM_FAULT_RETRY) {
>>>                       mmap_read_lock(mm);
>>> -                     if (hugepage_vma_revalidate(mm, haddr, &vma)) {
>>> -                             /* vma is no longer available, don't continue to swapin */
>>> -                             trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>>> -                             return false;
>>> -                     }
>>> -                     /* check if the pmd is still valid */
>>> -                     if (mm_find_pmd(mm, haddr) != pmd) {
>>> -                             trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>>> -                             return false;
>>> -                     }
>>> +                     trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>>> +                     return false;
>>>               }
>>>               if (ret & VM_FAULT_ERROR) {
>>>                       trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>>> --
>>> 2.23.0
>>>
>>>
>>
>> I've convinced myself this is correct, but don't understand how we got here.
>> AFAICT, we've always continued to fault in pages, and, as you mention, don't
>> retry ones that have failed with VM_FAULT_RETRY - so
>> __collapse_huge_page_isolate() should fail. I don't think (?) there is any
>> benefit to continuing to swap if we don't handle VM_FAULT_RETRY appropriately.
>>
>> So, I think this change looks good from that perspective. I suppose the only
>> other question would be: should we handle the VM_FAULT_RETRY case? Maybe 1
>> additional attempt then fail? AFAIK, this mostly (?) happens when the page is
>> locked.  Maybe it's not worth the extra complexity though..
> 
> It should be unnecessary for khugepaged IMHO since it will scan all
> the valid mm periodically, so it will come back eventually.

I tend to agree with Yang. Khugepaged will come back eventually so it's not
worth the extra complexity.

Thanks both!

> 
>>
> .
>
Miaohe Lin June 16, 2022, 6:40 a.m. UTC | #5
On 2022/6/16 1:49, Yang Shi wrote:
> On Sat, Jun 11, 2022 at 1:47 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> When do_swap_page returns VM_FAULT_RETRY, we do not retry here and thus
>> swap entry will remain in pagetable. This will result in later failure.
>> So stop swapping in pages in this case to save cpu cycles.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/khugepaged.c | 19 ++++++++-----------
>>  1 file changed, 8 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 73570dfffcec..a8adb2d1e9c6 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1003,19 +1003,16 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
>>                 swapped_in++;
>>                 ret = do_swap_page(&vmf);
>>
>> -               /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
>> +               /*
>> +                * do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
>> +                * Note we treat VM_FAULT_RETRY as VM_FAULT_ERROR here because
>> +                * we do not retry here and swap entry will remain in pagetable
>> +                * resulting in later failure.
> 
> Yeah, it makes sense.
> 
>> +                */
>>                 if (ret & VM_FAULT_RETRY) {
>>                         mmap_read_lock(mm);
> 
> A further optimization, you should not need to relock mmap_lock. You
> may consider returning a different value or passing in *locked and
> setting it to false, then check this value in the caller to skip
> unlock.

Could we just keep the mmap_sem unlocked when __collapse_huge_page_swapin() fails due to the caller
always doing mmap_read_unlock when __collapse_huge_page_swapin() returns false and add some comments
about this behavior? This looks like a simple way for me.

> 
>> -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
>> -                               /* vma is no longer available, don't continue to swapin */
>> -                               trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>> -                               return false;
>> -                       }
>> -                       /* check if the pmd is still valid */
>> -                       if (mm_find_pmd(mm, haddr) != pmd) {
>> -                               trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>> -                               return false;
>> -                       }
>> +                       trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>> +                       return false;
>>                 }
>>                 if (ret & VM_FAULT_ERROR) {
>>                         trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> 
> And I think "swapped_in++" needs to be moved after error handling.

Do you mean do "swapped_in++" only after pages are swapped in successfully?

Thanks!

> 
>> --
>> 2.23.0
>>
>>
> .
>
Yang Shi June 16, 2022, 3:46 p.m. UTC | #6
On Wed, Jun 15, 2022 at 11:40 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2022/6/16 1:49, Yang Shi wrote:
> > On Sat, Jun 11, 2022 at 1:47 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >>
> >> When do_swap_page returns VM_FAULT_RETRY, we do not retry here and thus
> >> swap entry will remain in pagetable. This will result in later failure.
> >> So stop swapping in pages in this case to save cpu cycles.
> >>
> >> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >> ---
> >>  mm/khugepaged.c | 19 ++++++++-----------
> >>  1 file changed, 8 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 73570dfffcec..a8adb2d1e9c6 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -1003,19 +1003,16 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> >>                 swapped_in++;
> >>                 ret = do_swap_page(&vmf);
> >>
> >> -               /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
> >> +               /*
> >> +                * do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
> >> +                * Note we treat VM_FAULT_RETRY as VM_FAULT_ERROR here because
> >> +                * we do not retry here and swap entry will remain in pagetable
> >> +                * resulting in later failure.
> >
> > Yeah, it makes sense.
> >
> >> +                */
> >>                 if (ret & VM_FAULT_RETRY) {
> >>                         mmap_read_lock(mm);
> >
> > A further optimization, you should not need to relock mmap_lock. You
> > may consider returning a different value or passing in *locked and
> > setting it to false, then check this value in the caller to skip
> > unlock.
>
> Could we just keep the mmap_sem unlocked when __collapse_huge_page_swapin() fails due to the caller
> always doing mmap_read_unlock when __collapse_huge_page_swapin() returns false and add some comments
> about this behavior? This looks like a simple way for me.

Yeah, that sounds better.

>
> >
> >> -                       if (hugepage_vma_revalidate(mm, haddr, &vma)) {
> >> -                               /* vma is no longer available, don't continue to swapin */
> >> -                               trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> >> -                               return false;
> >> -                       }
> >> -                       /* check if the pmd is still valid */
> >> -                       if (mm_find_pmd(mm, haddr) != pmd) {
> >> -                               trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> >> -                               return false;
> >> -                       }
> >> +                       trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> >> +                       return false;
> >>                 }
> >>                 if (ret & VM_FAULT_ERROR) {
> >>                         trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> >
> > And I think "swapped_in++" needs to be moved after error handling.
>
> Do you mean do "swapped_in++" only after pages are swapped in successfully?

Yes.

>
> Thanks!
>
> >
> >> --
> >> 2.23.0
> >>
> >>
> > .
> >
>
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 73570dfffcec..a8adb2d1e9c6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1003,19 +1003,16 @@  static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 		swapped_in++;
 		ret = do_swap_page(&vmf);
 
-		/* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */
+		/*
+		 * do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
+		 * Note we treat VM_FAULT_RETRY as VM_FAULT_ERROR here because
+		 * we do not retry here and swap entry will remain in pagetable
+		 * resulting in later failure.
+		 */
 		if (ret & VM_FAULT_RETRY) {
 			mmap_read_lock(mm);
-			if (hugepage_vma_revalidate(mm, haddr, &vma)) {
-				/* vma is no longer available, don't continue to swapin */
-				trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
-				return false;
-			}
-			/* check if the pmd is still valid */
-			if (mm_find_pmd(mm, haddr) != pmd) {
-				trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
-				return false;
-			}
+			trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
+			return false;
 		}
 		if (ret & VM_FAULT_ERROR) {
 			trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);