diff mbox series

[v3,1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

Message ID 20240401232946.1837665-2-jthoughton@google.com (mailing list archive)
State New
Headers show
Series mm/kvm: Improve parallelism for access bit harvesting | expand

Commit Message

James Houghton April 1, 2024, 11:29 p.m. UTC
The bitmap is provided for secondary MMUs to use if they support it. For
test_young(), after it returns, the bitmap represents the pages that
were young in the interval [start, end). For clear_young, it represents
the pages that we wish the secondary MMU to clear the accessed/young bit
for.

If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
should be unchanged except that if young PTEs are found and the
architecture supports passing in a bitmap, instead of returning 1,
MMU_NOTIFIER_YOUNG_FAST is returned.

This allows MGLRU's look-around logic to work faster, resulting in a 4%
improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
to indicate to main mm that doing look-around is likely to be
beneficial.

If the secondary MMU doesn't support the bitmap, it must return
an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

[1]: https://lore.kernel.org/all/20230609005935.42390-1-yuzhao@google.com/

Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/mmu_notifier.h | 93 +++++++++++++++++++++++++++++++++---
 include/trace/events/kvm.h   | 13 +++--
 mm/mmu_notifier.c            | 20 +++++---
 virt/kvm/kvm_main.c          | 19 ++++++--
 4 files changed, 123 insertions(+), 22 deletions(-)

Comments

David Hildenbrand April 4, 2024, 6:52 p.m. UTC | #1
On 02.04.24 01:29, James Houghton wrote:
> The bitmap is provided for secondary MMUs to use if they support it. For
> test_young(), after it returns, the bitmap represents the pages that
> were young in the interval [start, end). For clear_young, it represents
> the pages that we wish the secondary MMU to clear the accessed/young bit
> for.
> 
> If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
> should be unchanged except that if young PTEs are found and the
> architecture supports passing in a bitmap, instead of returning 1,
> MMU_NOTIFIER_YOUNG_FAST is returned.
> 
> This allows MGLRU's look-around logic to work faster, resulting in a 4%
> improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
> to indicate to main mm that doing look-around is likely to be
> beneficial.
> 
> If the secondary MMU doesn't support the bitmap, it must return
> an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> 
> [1]: https://lore.kernel.org/all/20230609005935.42390-1-yuzhao@google.com/
> 
> Suggested-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   include/linux/mmu_notifier.h | 93 +++++++++++++++++++++++++++++++++---
>   include/trace/events/kvm.h   | 13 +++--
>   mm/mmu_notifier.c            | 20 +++++---
>   virt/kvm/kvm_main.c          | 19 ++++++--
>   4 files changed, 123 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index f349e08a9dfe..daaa9db625d3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -61,6 +61,10 @@ enum mmu_notifier_event {
>   
>   #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>   
> +#define MMU_NOTIFIER_YOUNG			(1 << 0)
> +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE	(1 << 1)

Especially this one really deserves some documentation :)

> +#define MMU_NOTIFIER_YOUNG_FAST			(1 << 2)

And that one as well.

Likely best to briefly document all of them, and how they are
supposed to be used (return value for X).

> +
>   struct mmu_notifier_ops {
>   	/*
>   	 * Called either by mmu_notifier_unregister or when the mm is
> @@ -106,21 +110,36 @@ struct mmu_notifier_ops {
>   	 * clear_young is a lightweight version of clear_flush_young. Like the
>   	 * latter, it is supposed to test-and-clear the young/accessed bitflag
>   	 * in the secondary pte, but it may omit flushing the secondary tlb.
> +	 *
> +	 * If @bitmap is given but is not supported, return
> +	 * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> +	 *
> +	 * If the walk is done "quickly" and there were young PTEs,
> +	 * MMU_NOTIFIER_YOUNG_FAST is returned.
>   	 */
>   	int (*clear_young)(struct mmu_notifier *subscription,
>   			   struct mm_struct *mm,
>   			   unsigned long start,
> -			   unsigned long end);
> +			   unsigned long end,
> +			   unsigned long *bitmap);
>   
>   	/*
>   	 * test_young is called to check the young/accessed bitflag in
>   	 * the secondary pte. This is used to know if the page is
>   	 * frequently used without actually clearing the flag or tearing
>   	 * down the secondary mapping on the page.
> +	 *
> +	 * If @bitmap is given but is not supported, return
> +	 * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> +	 *
> +	 * If the walk is done "quickly" and there were young PTEs,
> +	 * MMU_NOTIFIER_YOUNG_FAST is returned.
>   	 */
>   	int (*test_young)(struct mmu_notifier *subscription,
>   			  struct mm_struct *mm,
> -			  unsigned long address);
> +			  unsigned long start,
> +			  unsigned long end,
> +			  unsigned long *bitmap);

What does "quickly" mean (why not use "fast")? What are the semantics, I 
don't find any existing usage of that in this file.

Further, what is MMU_NOTIFIER_YOUNG you introduce used for?
James Houghton April 9, 2024, 6:31 p.m. UTC | #2
Ah, I didn't see this in my inbox, sorry David!

On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.04.24 01:29, James Houghton wrote:
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index f349e08a9dfe..daaa9db625d3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -61,6 +61,10 @@ enum mmu_notifier_event {
> >
> >   #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >
> > +#define MMU_NOTIFIER_YOUNG                   (1 << 0)
> > +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
>
> Especially this one really deserves some documentation :)

Yes, will do. Something like

    MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
bitmap either (1) does not accurately represent the age of the pages
(in the case of test_young), or (2) was not able to be used to
completely clear the age/access bit (in the case of clear_young).

>
> > +#define MMU_NOTIFIER_YOUNG_FAST                      (1 << 2)
>
> And that one as well.

Something like

   Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
would have been supported for this address range.

The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
is able to harvest the access bit "fast" (so for x86, locklessly, and
for arm64, with the KVM MMU read lock), "fast" enough that using a
bitmap to do look-around is probably a good idea.

>
> Likely best to briefly document all of them, and how they are
> supposed to be used (return value for X).

Right. Will do.

>
> > +
> >   struct mmu_notifier_ops {
> >       /*
> >        * Called either by mmu_notifier_unregister or when the mm is
> > @@ -106,21 +110,36 @@ struct mmu_notifier_ops {
> >        * clear_young is a lightweight version of clear_flush_young. Like the
> >        * latter, it is supposed to test-and-clear the young/accessed bitflag
> >        * in the secondary pte, but it may omit flushing the secondary tlb.
> > +      *
> > +      * If @bitmap is given but is not supported, return
> > +      * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> > +      *
> > +      * If the walk is done "quickly" and there were young PTEs,
> > +      * MMU_NOTIFIER_YOUNG_FAST is returned.
> >        */
> >       int (*clear_young)(struct mmu_notifier *subscription,
> >                          struct mm_struct *mm,
> >                          unsigned long start,
> > -                        unsigned long end);
> > +                        unsigned long end,
> > +                        unsigned long *bitmap);
> >
> >       /*
> >        * test_young is called to check the young/accessed bitflag in
> >        * the secondary pte. This is used to know if the page is
> >        * frequently used without actually clearing the flag or tearing
> >        * down the secondary mapping on the page.
> > +      *
> > +      * If @bitmap is given but is not supported, return
> > +      * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> > +      *
> > +      * If the walk is done "quickly" and there were young PTEs,
> > +      * MMU_NOTIFIER_YOUNG_FAST is returned.
> >        */
> >       int (*test_young)(struct mmu_notifier *subscription,
> >                         struct mm_struct *mm,
> > -                       unsigned long address);
> > +                       unsigned long start,
> > +                       unsigned long end,
> > +                       unsigned long *bitmap);
>
> What does "quickly" mean (why not use "fast")? What are the semantics, I
> don't find any existing usage of that in this file.

"fast" means fast enough such that using a bitmap to scan adjacent
pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in
this comment. Perhaps I should just rename it to
MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be
beneficial" thing -- that's for MGLRU/etc. to decide really.

>
> Further, what is MMU_NOTIFIER_YOUNG you introduce used for?

MMU_NOTIFIER_YOUNG is the return value when the page was young, but we
(1) didn't use a bitmap, and (2) the "fast" access bit harvesting
wasn't possible. In this case we simply return 1, which is
MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young()
properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that
it will be 1.

Thanks David!
David Hildenbrand April 9, 2024, 7:35 p.m. UTC | #3
On 09.04.24 20:31, James Houghton wrote:
> Ah, I didn't see this in my inbox, sorry David!

No worries :)

> 
> On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 02.04.24 01:29, James Houghton wrote:
>>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>>> index f349e08a9dfe..daaa9db625d3 100644
>>> --- a/include/linux/mmu_notifier.h
>>> +++ b/include/linux/mmu_notifier.h
>>> @@ -61,6 +61,10 @@ enum mmu_notifier_event {
>>>
>>>    #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>>>
>>> +#define MMU_NOTIFIER_YOUNG                   (1 << 0)
>>> +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
>>
>> Especially this one really deserves some documentation :)
> 
> Yes, will do. Something like
> 
>      MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
> bitmap either (1) does not accurately represent the age of the pages
> (in the case of test_young), or (2) was not able to be used to
> completely clear the age/access bit (in the case of clear_young).

Make sense. I do wonder what the expected reaction from the caller is :)

> 
>>
>>> +#define MMU_NOTIFIER_YOUNG_FAST                      (1 << 2)
>>
>> And that one as well.
> 
> Something like
> 
>     Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
> would have been supported for this address range.
> 
> The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
> is able to harvest the access bit "fast" (so for x86, locklessly, and
> for arm64, with the KVM MMU read lock), "fast" enough that using a
> bitmap to do look-around is probably a good idea.

Is that really the right way to communicate that ("would have been 
supported") -- wouldn't we want to sense support differently?

> 
>>
>> Likely best to briefly document all of them, and how they are
>> supposed to be used (return value for X).
> 
> Right. Will do.
> 
>>
>>> +
>>>    struct mmu_notifier_ops {
>>>        /*
>>>         * Called either by mmu_notifier_unregister or when the mm is
>>> @@ -106,21 +110,36 @@ struct mmu_notifier_ops {
>>>         * clear_young is a lightweight version of clear_flush_young. Like the
>>>         * latter, it is supposed to test-and-clear the young/accessed bitflag
>>>         * in the secondary pte, but it may omit flushing the secondary tlb.
>>> +      *
>>> +      * If @bitmap is given but is not supported, return
>>> +      * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>>> +      *
>>> +      * If the walk is done "quickly" and there were young PTEs,
>>> +      * MMU_NOTIFIER_YOUNG_FAST is returned.
>>>         */
>>>        int (*clear_young)(struct mmu_notifier *subscription,
>>>                           struct mm_struct *mm,
>>>                           unsigned long start,
>>> -                        unsigned long end);
>>> +                        unsigned long end,
>>> +                        unsigned long *bitmap);
>>>
>>>        /*
>>>         * test_young is called to check the young/accessed bitflag in
>>>         * the secondary pte. This is used to know if the page is
>>>         * frequently used without actually clearing the flag or tearing
>>>         * down the secondary mapping on the page.
>>> +      *
>>> +      * If @bitmap is given but is not supported, return
>>> +      * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>>> +      *
>>> +      * If the walk is done "quickly" and there were young PTEs,
>>> +      * MMU_NOTIFIER_YOUNG_FAST is returned.
>>>         */
>>>        int (*test_young)(struct mmu_notifier *subscription,
>>>                          struct mm_struct *mm,
>>> -                       unsigned long address);
>>> +                       unsigned long start,
>>> +                       unsigned long end,
>>> +                       unsigned long *bitmap);
>>
>> What does "quickly" mean (why not use "fast")? What are the semantics, I
>> don't find any existing usage of that in this file.
> 
> "fast" means fast enough such that using a bitmap to scan adjacent
> pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in
> this comment. Perhaps I should just rename it to
> MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be
> beneficial" thing -- that's for MGLRU/etc. to decide really.

Yes!

> 
>>
>> Further, what is MMU_NOTIFIER_YOUNG you introduce used for?
> 
> MMU_NOTIFIER_YOUNG is the return value when the page was young, but we
> (1) didn't use a bitmap, and (2) the "fast" access bit harvesting
> wasn't possible. In this case we simply return 1, which is
> MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young()
> properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that
> it will be 1.

Yes, that will clarify it!
James Houghton April 11, 2024, 12:35 a.m. UTC | #4
On Tue, Apr 9, 2024 at 12:35 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 09.04.24 20:31, James Houghton wrote:
> > Ah, I didn't see this in my inbox, sorry David!
>
> No worries :)
>
> >
> > On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 02.04.24 01:29, James Houghton wrote:
> >>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >>> index f349e08a9dfe..daaa9db625d3 100644
> >>> --- a/include/linux/mmu_notifier.h
> >>> +++ b/include/linux/mmu_notifier.h
> >>> @@ -61,6 +61,10 @@ enum mmu_notifier_event {
> >>>
> >>>    #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >>>
> >>> +#define MMU_NOTIFIER_YOUNG                   (1 << 0)
> >>> +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
> >>
> >> Especially this one really deserves some documentation :)
> >
> > Yes, will do. Something like
> >
> >      MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
> > bitmap either (1) does not accurately represent the age of the pages
> > (in the case of test_young), or (2) was not able to be used to
> > completely clear the age/access bit (in the case of clear_young).
>
> Make sense. I do wonder what the expected reaction from the caller is :)

In this series the caller doesn't actually care (matching what Yu had
in his v2[1]). test_spte_young() probably ought to return false if it
finds MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (and I'll do this in v4 if
no one objects), but it doesn't have to. The bitmap will never say
that a page is young when it was actually not, only the other way
around.

>
> >
> >>
> >>> +#define MMU_NOTIFIER_YOUNG_FAST                      (1 << 2)
> >>
> >> And that one as well.
> >
> > Something like
> >
> >     Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
> > would have been supported for this address range.
> >
> > The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
> > is able to harvest the access bit "fast" (so for x86, locklessly, and
> > for arm64, with the KVM MMU read lock), "fast" enough that using a
> > bitmap to do look-around is probably a good idea.
>
> Is that really the right way to communicate that ("would have been
> supported") -- wouldn't we want to sense support differently?

What I have now seems fine to me. It would be a little nicer to have a
way to query for bitmap support and make sure that the answer will not
be stale by the time we call the bitmap notifiers, but the complexity
to make that work seems unnecessary for dealing with such an uncommon
scenario.

Maybe the right thing to do is just to have KVM always return the same
answer. So instead of checking if the shadow root is allocated, we
could check something else (I'm not sure what exactly yet though...).

[1]: https://lore.kernel.org/kvmarm/20230526234435.662652-11-yuzhao@google.com/
David Matlack April 12, 2024, 6:45 p.m. UTC | #5
On 2024-04-01 11:29 PM, James Houghton wrote:
> The bitmap is provided for secondary MMUs to use if they support it. For
> test_young(), after it returns, the bitmap represents the pages that
> were young in the interval [start, end). For clear_young, it represents
> the pages that we wish the secondary MMU to clear the accessed/young bit
> for.
> 
> If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
> should be unchanged except that if young PTEs are found and the
> architecture supports passing in a bitmap, instead of returning 1,
> MMU_NOTIFIER_YOUNG_FAST is returned.
> 
> This allows MGLRU's look-around logic to work faster, resulting in a 4%
> improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
> to indicate to main mm that doing look-around is likely to be
> beneficial.
> 
> If the secondary MMU doesn't support the bitmap, it must return
> an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> 
> [1]: https://lore.kernel.org/all/20230609005935.42390-1-yuzhao@google.com/
> 
> Suggested-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/mmu_notifier.h | 93 +++++++++++++++++++++++++++++++++---
>  include/trace/events/kvm.h   | 13 +++--
>  mm/mmu_notifier.c            | 20 +++++---
>  virt/kvm/kvm_main.c          | 19 ++++++--
>  4 files changed, 123 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index f349e08a9dfe..daaa9db625d3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -61,6 +61,10 @@ enum mmu_notifier_event {
>  
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>  
> +#define MMU_NOTIFIER_YOUNG			(1 << 0)
> +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE	(1 << 1)

MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE appears to be unused by all callers
of test/clear_young(). I would vote to remove it.

> +#define MMU_NOTIFIER_YOUNG_FAST			(1 << 2)

Instead of MMU_NOTIFIER_YOUNG_FAST, how about
MMU_NOTIFIER_YOUNG_LOOK_AROUND? i.e. The secondary MMU is returning
saying it recommends doing a look-around and passing in a bitmap?

That would avoid the whole "what does FAST really mean" confusion.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fb49c2a60200..ca4b1ef9dfc2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -917,10 +917,15 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
>  static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
>  					struct mm_struct *mm,
>  					unsigned long start,
> -					unsigned long end)
> +					unsigned long end,
> +					unsigned long *bitmap)
>  {
>  	trace_kvm_age_hva(start, end);
>  
> +	/* We don't support bitmaps. Don't test or clear anything. */
> +	if (bitmap)
> +		return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;

Wouldn't it be a bug to get a bitmap here? The main MM is only suppost
to pass in a bitmap if the secondary MMU returns
MMU_NOTIFIER_YOUNG_FAST, which KVM does not do at this point.

Put another way, this check seems unneccessary.

> +
>  	/*
>  	 * Even though we do not flush TLB, this will still adversely
>  	 * affect performance on pre-Haswell Intel EPT, where there is
> @@ -939,11 +944,17 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
>  
>  static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long address)
> +				       unsigned long start,
> +				       unsigned long end,
> +				       unsigned long *bitmap)
>  {
> -	trace_kvm_test_age_hva(address);
> +	trace_kvm_test_age_hva(start, end);
> +
> +	/* We don't support bitmaps. Don't test or clear anything. */
> +	if (bitmap)
> +		return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;

Same thing here.
James Houghton April 19, 2024, 8:34 p.m. UTC | #6
On Fri, Apr 12, 2024 at 11:45 AM David Matlack <dmatlack@google.com> wrote:
>
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > The bitmap is provided for secondary MMUs to use if they support it. For
> > test_young(), after it returns, the bitmap represents the pages that
> > were young in the interval [start, end). For clear_young, it represents
> > the pages that we wish the secondary MMU to clear the accessed/young bit
> > for.
> >
> > If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
> > should be unchanged except that if young PTEs are found and the
> > architecture supports passing in a bitmap, instead of returning 1,
> > MMU_NOTIFIER_YOUNG_FAST is returned.
> >
> > This allows MGLRU's look-around logic to work faster, resulting in a 4%
> > improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
> > to indicate to main mm that doing look-around is likely to be
> > beneficial.
> >
> > If the secondary MMU doesn't support the bitmap, it must return
> > an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> >
> > [1]: https://lore.kernel.org/all/20230609005935.42390-1-yuzhao@google.com/
> >
> > Suggested-by: Yu Zhao <yuzhao@google.com>
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/mmu_notifier.h | 93 +++++++++++++++++++++++++++++++++---
> >  include/trace/events/kvm.h   | 13 +++--
> >  mm/mmu_notifier.c            | 20 +++++---
> >  virt/kvm/kvm_main.c          | 19 ++++++--
> >  4 files changed, 123 insertions(+), 22 deletions(-)
> >
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index f349e08a9dfe..daaa9db625d3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -61,6 +61,10 @@ enum mmu_notifier_event {
> >
> >  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >
> > +#define MMU_NOTIFIER_YOUNG                   (1 << 0)
> > +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
>
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE appears to be unused by all callers
> of test/clear_young(). I would vote to remove it.

Works for me.

>
> > +#define MMU_NOTIFIER_YOUNG_FAST                      (1 << 2)
>
> Instead of MMU_NOTIFIER_YOUNG_FAST, how about
> MMU_NOTIFIER_YOUNG_LOOK_AROUND? i.e. The secondary MMU is returning
> saying it recommends doing a look-around and passing in a bitmap?
>
> That would avoid the whole "what does FAST really mean" confusion.

I think MMU_NOTIFIER_YOUNG_LOOK_AROUND is fine.

>
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fb49c2a60200..ca4b1ef9dfc2 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -917,10 +917,15 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> >  static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> >                                       struct mm_struct *mm,
> >                                       unsigned long start,
> > -                                     unsigned long end)
> > +                                     unsigned long end,
> > +                                     unsigned long *bitmap)
> >  {
> >       trace_kvm_age_hva(start, end);
> >
> > +     /* We don't support bitmaps. Don't test or clear anything. */
> > +     if (bitmap)
> > +             return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
>
> Wouldn't it be a bug to get a bitmap here? The main MM is only suppost
> to pass in a bitmap if the secondary MMU returns
> MMU_NOTIFIER_YOUNG_FAST, which KVM does not do at this point.
>
> Put another way, this check seems unneccessary.
>
> > +
> >       /*
> >        * Even though we do not flush TLB, this will still adversely
> >        * affect performance on pre-Haswell Intel EPT, where there is
> > @@ -939,11 +944,17 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> >
> >  static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> >                                      struct mm_struct *mm,
> > -                                    unsigned long address)
> > +                                    unsigned long start,
> > +                                    unsigned long end,
> > +                                    unsigned long *bitmap)
> >  {
> > -     trace_kvm_test_age_hva(address);
> > +     trace_kvm_test_age_hva(start, end);
> > +
> > +     /* We don't support bitmaps. Don't test or clear anything. */
> > +     if (bitmap)
> > +             return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
>
> Same thing here.

I will remove them, they are indeed unnecessary, as it is just dead code.
diff mbox series

Patch

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..daaa9db625d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,10 @@  enum mmu_notifier_event {
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
 
+#define MMU_NOTIFIER_YOUNG			(1 << 0)
+#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE	(1 << 1)
+#define MMU_NOTIFIER_YOUNG_FAST			(1 << 2)
+
 struct mmu_notifier_ops {
 	/*
 	 * Called either by mmu_notifier_unregister or when the mm is
@@ -106,21 +110,36 @@  struct mmu_notifier_ops {
 	 * clear_young is a lightweight version of clear_flush_young. Like the
 	 * latter, it is supposed to test-and-clear the young/accessed bitflag
 	 * in the secondary pte, but it may omit flushing the secondary tlb.
+	 *
+	 * If @bitmap is given but is not supported, return
+	 * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+	 *
+	 * If the walk is done "quickly" and there were young PTEs,
+	 * MMU_NOTIFIER_YOUNG_FAST is returned.
 	 */
 	int (*clear_young)(struct mmu_notifier *subscription,
 			   struct mm_struct *mm,
 			   unsigned long start,
-			   unsigned long end);
+			   unsigned long end,
+			   unsigned long *bitmap);
 
 	/*
 	 * test_young is called to check the young/accessed bitflag in
 	 * the secondary pte. This is used to know if the page is
 	 * frequently used without actually clearing the flag or tearing
 	 * down the secondary mapping on the page.
+	 *
+	 * If @bitmap is given but is not supported, return
+	 * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+	 *
+	 * If the walk is done "quickly" and there were young PTEs,
+	 * MMU_NOTIFIER_YOUNG_FAST is returned.
 	 */
 	int (*test_young)(struct mmu_notifier *subscription,
 			  struct mm_struct *mm,
-			  unsigned long address);
+			  unsigned long start,
+			  unsigned long end,
+			  unsigned long *bitmap);
 
 	/*
 	 * change_pte is called in cases that pte mapping to page is changed:
@@ -388,10 +407,11 @@  extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long start,
 					  unsigned long end);
 extern int __mmu_notifier_clear_young(struct mm_struct *mm,
-				      unsigned long start,
-				      unsigned long end);
+				      unsigned long start, unsigned long end,
+				      unsigned long *bitmap);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
-				     unsigned long address);
+				     unsigned long start, unsigned long end,
+				     unsigned long *bitmap);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
@@ -427,7 +447,25 @@  static inline int mmu_notifier_clear_young(struct mm_struct *mm,
 					   unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		return __mmu_notifier_clear_young(mm, start, end);
+		return __mmu_notifier_clear_young(mm, start, end, NULL);
+	return 0;
+}
+
+/*
+ * When @bitmap is not provided, clear the young bits in the secondary
+ * MMUs for all of the pages in the interval [start, end).
+ *
+ * If any subscribed secondary MMU does not support @bitmap, this function
+ * will return an integer containing MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+ * Some work may have been done in the secondary MMU.
+ */
+static inline int mmu_notifier_clear_young_bitmap(struct mm_struct *mm,
+						  unsigned long start,
+						  unsigned long end,
+						  unsigned long *bitmap)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_young(mm, start, end, bitmap);
 	return 0;
 }
 
@@ -435,7 +473,25 @@  static inline int mmu_notifier_test_young(struct mm_struct *mm,
 					  unsigned long address)
 {
 	if (mm_has_notifiers(mm))
-		return __mmu_notifier_test_young(mm, address);
+		return __mmu_notifier_test_young(mm, address, address + 1,
+						 NULL);
+	return 0;
+}
+
+/*
+ * When @bitmap is not provided, test the young bits in the secondary
+ * MMUs for all of the pages in the interval [start, end).
+ *
+ * If any subscribed secondary MMU does not support @bitmap, this function
+ * will return an integer containing MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+ */
+static inline int mmu_notifier_test_young_bitmap(struct mm_struct *mm,
+						 unsigned long start,
+						 unsigned long end,
+						 unsigned long *bitmap)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_test_young(mm, start, end, bitmap);
 	return 0;
 }
 
@@ -644,12 +700,35 @@  static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
 	return 0;
 }
 
+static inline int mmu_notifier_clear_young(struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long end)
+{
+	return 0;
+}
+
+static inline int mmu_notifier_clear_young_bitmap(struct mm_struct *mm,
+						  unsigned long start,
+						  unsigned long end,
+						  unsigned long *bitmap)
+{
+	return 0;
+}
+
 static inline int mmu_notifier_test_young(struct mm_struct *mm,
 					  unsigned long address)
 {
 	return 0;
 }
 
+static inline int mmu_notifier_test_young_bitmap(struct mm_struct *mm,
+						 unsigned long start,
+						 unsigned long end,
+						 unsigned long *bitmap)
+{
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 011fba6b5552..e4ace8cfdbba 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -490,18 +490,21 @@  TRACE_EVENT(kvm_age_hva,
 );
 
 TRACE_EVENT(kvm_test_age_hva,
-	TP_PROTO(unsigned long hva),
-	TP_ARGS(hva),
+	TP_PROTO(unsigned long start, unsigned long end),
+	TP_ARGS(start, end),
 
 	TP_STRUCT__entry(
-		__field(	unsigned long,	hva		)
+		__field(	unsigned long,	start		)
+		__field(	unsigned long,	end		)
 	),
 
 	TP_fast_assign(
-		__entry->hva		= hva;
+		__entry->start		= start;
+		__entry->end		= end;
 	),
 
-	TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
+	TP_printk("mmu notifier test age hva: %#016lx -- %#016lx",
+		  __entry->start, __entry->end)
 );
 
 #endif /* _TRACE_KVM_MAIN_H */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index ec3b068cbbe6..e70c6222944c 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -384,7 +384,8 @@  int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 
 int __mmu_notifier_clear_young(struct mm_struct *mm,
 			       unsigned long start,
-			       unsigned long end)
+			       unsigned long end,
+			       unsigned long *bitmap)
 {
 	struct mmu_notifier *subscription;
 	int young = 0, id;
@@ -395,7 +396,8 @@  int __mmu_notifier_clear_young(struct mm_struct *mm,
 				 srcu_read_lock_held(&srcu)) {
 		if (subscription->ops->clear_young)
 			young |= subscription->ops->clear_young(subscription,
-								mm, start, end);
+								mm, start, end,
+								bitmap);
 	}
 	srcu_read_unlock(&srcu, id);
 
@@ -403,7 +405,8 @@  int __mmu_notifier_clear_young(struct mm_struct *mm,
 }
 
 int __mmu_notifier_test_young(struct mm_struct *mm,
-			      unsigned long address)
+			      unsigned long start, unsigned long end,
+			      unsigned long *bitmap)
 {
 	struct mmu_notifier *subscription;
 	int young = 0, id;
@@ -413,9 +416,14 @@  int __mmu_notifier_test_young(struct mm_struct *mm,
 				 &mm->notifier_subscriptions->list, hlist,
 				 srcu_read_lock_held(&srcu)) {
 		if (subscription->ops->test_young) {
-			young = subscription->ops->test_young(subscription, mm,
-							      address);
-			if (young)
+			young |= subscription->ops->test_young(subscription, mm,
+							       start, end,
+							       bitmap);
+			if (young && !bitmap)
+				/*
+				 * We're not using a bitmap, so there is no
+				 * need to check any more secondary MMUs.
+				 */
 				break;
 		}
 	}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb49c2a60200..ca4b1ef9dfc2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -917,10 +917,15 @@  static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long start,
-					unsigned long end)
+					unsigned long end,
+					unsigned long *bitmap)
 {
 	trace_kvm_age_hva(start, end);
 
+	/* We don't support bitmaps. Don't test or clear anything. */
+	if (bitmap)
+		return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
+
 	/*
 	 * Even though we do not flush TLB, this will still adversely
 	 * affect performance on pre-Haswell Intel EPT, where there is
@@ -939,11 +944,17 @@  static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long address)
+				       unsigned long start,
+				       unsigned long end,
+				       unsigned long *bitmap)
 {
-	trace_kvm_test_age_hva(address);
+	trace_kvm_test_age_hva(start, end);
+
+	/* We don't support bitmaps. Don't test or clear anything. */
+	if (bitmap)
+		return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
 
-	return kvm_handle_hva_range_no_flush(mn, address, address + 1,
+	return kvm_handle_hva_range_no_flush(mn, start, end,
 					     kvm_test_age_gfn);
 }