[v1] mm/gup: adjust stale comment for RCU GUP-fast

Message ID	20220901072119.37588-1-david@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Sasha Levin <sasha.levin@oracle.com>, "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>, Vlastimil Babka <vbabka@suse.cz>, Jerome Marchand <jmarchan@redhat.com>, Andrea Arcangeli <aarcange@redhat.com>, Hugh Dickins <hughd@google.com>, Jason Gunthorpe <jgg@nvidia.com>, John Hubbard <jhubbard@nvidia.com>, Peter Xu <peterx@redhat.com>, Yang Shi <shy828301@gmail.com> Subject: [PATCH v1] mm/gup: adjust stale comment for RCU GUP-fast Date: Thu, 1 Sep 2022 09:21:19 +0200 Message-Id: <20220901072119.37588-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v1] mm/gup: adjust stale comment for RCU GUP-fast \| expand [v1] mm/gup: adjust stale comment for RCU GUP-fast

David Hildenbrand Sept. 1, 2022, 7:21 a.m. UTC

commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
PMDs") didn't remove all details about the THP split requirements for
RCU GUP-fast.

IPI broeadcasts on THP split are no longer required.

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Kirill A . Shutemov Sept. 1, 2022, 2:37 p.m. UTC | #1

On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
> PMDs") didn't remove all details about the THP split requirements for
> RCU GUP-fast.
> 
> IPI broeadcasts on THP split are no longer required.
> 
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Sasha Levin <sasha.levin@oracle.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jerome Marchand <jmarchan@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Jason Gunthorpe Sept. 1, 2022, 4:12 p.m. UTC | #2

On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
> PMDs") didn't remove all details about the THP split requirements for
> RCU GUP-fast.
> 
> IPI broeadcasts on THP split are no longer required.
> 
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Sasha Levin <sasha.levin@oracle.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jerome Marchand <jmarchan@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/gup.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

The comment a bit above seems to need touching to:

 * protected from page table pages being freed from under it, and should
 * block any THP splits.

What is the current situation for THP splits anyhow? Is there are
comment in the fast pmd code explaining it?

But this seems OK too

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

Peter Xu Sept. 1, 2022, 4:28 p.m. UTC | #3

On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
> PMDs") didn't remove all details about the THP split requirements for
> RCU GUP-fast.
> 
> IPI broeadcasts on THP split are no longer required.
> 
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Sasha Levin <sasha.levin@oracle.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jerome Marchand <jmarchan@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/gup.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 5abdaf487460..cfe71f422787 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
>   *
>   * Another way to achieve this is to batch up page table containing pages
>   * belonging to more than one mm_user, then rcu_sched a callback to free those
> - * pages. Disabling interrupts will allow the fast_gup walker to both block
> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> - * (which is a relatively rare event). The code below adopts this strategy.
> + * pages. Disabling interrupts will allow the fast_gup walker to block the
> + * rcu_sched callback.

This is the comment for fast-gup in general but not only for thp split.

I can understand that we don't need IPI for thp split, but isn't the IPIs
still needed for thp collapse (aka pmdp_collapse_flush)?

David Hildenbrand Sept. 1, 2022, 4:30 p.m. UTC | #4

On 01.09.22 18:12, Jason Gunthorpe wrote:
> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
>> PMDs") didn't remove all details about the THP split requirements for
>> RCU GUP-fast.
>>
>> IPI broeadcasts on THP split are no longer required.
>>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Sasha Levin <sasha.levin@oracle.com>
>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Jerome Marchand <jmarchan@redhat.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: John Hubbard <jhubbard@nvidia.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/gup.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> The comment a bit above seems to need touching to:
> 
>  * protected from page table pages being freed from under it, and should
>  * block any THP splits.

Ah right. Will drop it as well -- thanks!

> 
> What is the current situation for THP splits anyhow? Is there are
> comment in the fast pmd code explaining it?

Not aware of a comment, I think it just works naturally by always
re-routing references taken on tail pages to the head page refcount.

Before that, we had "Tail page refcounting", which -- as I understand --
was a mess.

ddc58f27f9eee9117219936f77e90ad5b2e00e96 contains some comments.

> 
> But this seems OK too
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason
>

David Hildenbrand Sept. 1, 2022, 4:34 p.m. UTC | #5

On 01.09.22 18:28, Peter Xu wrote:
> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
>> PMDs") didn't remove all details about the THP split requirements for
>> RCU GUP-fast.
>>
>> IPI broeadcasts on THP split are no longer required.
>>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Sasha Levin <sasha.levin@oracle.com>
>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Jerome Marchand <jmarchan@redhat.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>> Cc: John Hubbard <jhubbard@nvidia.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/gup.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 5abdaf487460..cfe71f422787 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
>>   *
>>   * Another way to achieve this is to batch up page table containing pages
>>   * belonging to more than one mm_user, then rcu_sched a callback to free those
>> - * pages. Disabling interrupts will allow the fast_gup walker to both block
>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
>> - * (which is a relatively rare event). The code below adopts this strategy.
>> + * pages. Disabling interrupts will allow the fast_gup walker to block the
>> + * rcu_sched callback.
> 
> This is the comment for fast-gup in general but not only for thp split.

"an IPI that we broadcast for splitting THP" is about splitting THP.

> 
> I can understand that we don't need IPI for thp split, but isn't the IPIs
> still needed for thp collapse (aka pmdp_collapse_flush)?

That was, unfortunately, never documented -- and as discussed in the
other thread, arm64 doesn't do that IPI before collapse and might need
fixing. We'll most probably end up getting rid of that
(undocumented/forgotten) IPI requirement and fix it in GUP-fast by
re-rechecking if the PMD changed.

Thanks!

Peter Xu Sept. 1, 2022, 4:40 p.m. UTC | #6

On Thu, Sep 01, 2022 at 06:34:41PM +0200, David Hildenbrand wrote:
> On 01.09.22 18:28, Peter Xu wrote:
> > On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
> >> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
> >> PMDs") didn't remove all details about the THP split requirements for
> >> RCU GUP-fast.
> >>
> >> IPI broeadcasts on THP split are no longer required.
> >>
> >> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> Cc: Sasha Levin <sasha.levin@oracle.com>
> >> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >> Cc: Vlastimil Babka <vbabka@suse.cz>
> >> Cc: Jerome Marchand <jmarchan@redhat.com>
> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Cc: Jason Gunthorpe <jgg@nvidia.com>
> >> Cc: John Hubbard <jhubbard@nvidia.com>
> >> Cc: Peter Xu <peterx@redhat.com>
> >> Cc: Yang Shi <shy828301@gmail.com>
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >> ---
> >>  mm/gup.c | 5 ++---
> >>  1 file changed, 2 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 5abdaf487460..cfe71f422787 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
> >>   *
> >>   * Another way to achieve this is to batch up page table containing pages
> >>   * belonging to more than one mm_user, then rcu_sched a callback to free those
> >> - * pages. Disabling interrupts will allow the fast_gup walker to both block
> >> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> >> - * (which is a relatively rare event). The code below adopts this strategy.
> >> + * pages. Disabling interrupts will allow the fast_gup walker to block the
> >> + * rcu_sched callback.
> > 
> > This is the comment for fast-gup in general but not only for thp split.
> 
> "an IPI that we broadcast for splitting THP" is about splitting THP.

Ah OK.  Shall we still keep some "IPI broadcast" information here if we're
modifying it?  Otherwise it gives a feeling that none needs the IPIs.

It can be dropped later if you want to rework the thp collapse side and
finally remove IPI dependency on fast-gup, but so far it seems to me it's
still needed.  Or just drop this patch until that rework happens?

> 
> > 
> > I can understand that we don't need IPI for thp split, but isn't the IPIs
> > still needed for thp collapse (aka pmdp_collapse_flush)?
> 
> That was, unfortunately, never documented -- and as discussed in the
> other thread, arm64 doesn't do that IPI before collapse and might need
> fixing. We'll most probably end up getting rid of that
> (undocumented/forgotten) IPI requirement and fix it in GUP-fast by
> re-rechecking if the PMD changed.

Yeah from an initial thought that looks valid to me.  It'll also allow
pmdp_collapse_flush() to be dropped too, am I right?

David Hildenbrand Sept. 1, 2022, 4:46 p.m. UTC | #7

On 01.09.22 18:40, Peter Xu wrote:
> On Thu, Sep 01, 2022 at 06:34:41PM +0200, David Hildenbrand wrote:
>> On 01.09.22 18:28, Peter Xu wrote:
>>> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
>>>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
>>>> PMDs") didn't remove all details about the THP split requirements for
>>>> RCU GUP-fast.
>>>>
>>>> IPI broeadcasts on THP split are no longer required.
>>>>
>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Cc: Sasha Levin <sasha.levin@oracle.com>
>>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>>> Cc: Jerome Marchand <jmarchan@redhat.com>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: Yang Shi <shy828301@gmail.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>  mm/gup.c | 5 ++---
>>>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index 5abdaf487460..cfe71f422787 100644
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
>>>>   *
>>>>   * Another way to achieve this is to batch up page table containing pages
>>>>   * belonging to more than one mm_user, then rcu_sched a callback to free those
>>>> - * pages. Disabling interrupts will allow the fast_gup walker to both block
>>>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
>>>> - * (which is a relatively rare event). The code below adopts this strategy.
>>>> + * pages. Disabling interrupts will allow the fast_gup walker to block the
>>>> + * rcu_sched callback.
>>>
>>> This is the comment for fast-gup in general but not only for thp split.
>>
>> "an IPI that we broadcast for splitting THP" is about splitting THP.
> 
> Ah OK.  Shall we still keep some "IPI broadcast" information here if we're
> modifying it?  Otherwise it gives a feeling that none needs the IPIs.

I guess that's the end goal -- and we forgot about the PMD collapse case.

Are we aware of any other case that needs an IPI? I'd rather avoid
documenting something that's no longer true.

> 
> It can be dropped later if you want to rework the thp collapse side and
> finally remove IPI dependency on fast-gup, but so far it seems to me it's
> still needed.  Or just drop this patch until that rework happens?

The doc as is is obviously stale, why drop this patch?

We should see a fix for the THP collapse issue very soon I guess. Most
probably this patch will go upstream after that fix.

> 
>>
>>>
>>> I can understand that we don't need IPI for thp split, but isn't the IPIs
>>> still needed for thp collapse (aka pmdp_collapse_flush)?
>>
>> That was, unfortunately, never documented -- and as discussed in the
>> other thread, arm64 doesn't do that IPI before collapse and might need
>> fixing. We'll most probably end up getting rid of that
>> (undocumented/forgotten) IPI requirement and fix it in GUP-fast by
>> re-rechecking if the PMD changed.
> 
> Yeah from an initial thought that looks valid to me.  It'll also allow
> pmdp_collapse_flush() to be dropped too, am I right?

I think the magic about pmdp_collapse_flush() is not only the IPIs, but
that we don't perform an ordinary PMD flush but we logically flush "all
PTEs in that range".

Apparently, that's a difference on some architectures.

Peter Xu Sept. 1, 2022, 5:41 p.m. UTC | #8

On Thu, Sep 01, 2022 at 06:46:13PM +0200, David Hildenbrand wrote:
> On 01.09.22 18:40, Peter Xu wrote:
> > On Thu, Sep 01, 2022 at 06:34:41PM +0200, David Hildenbrand wrote:
> >> On 01.09.22 18:28, Peter Xu wrote:
> >>> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
> >>>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
> >>>> PMDs") didn't remove all details about the THP split requirements for
> >>>> RCU GUP-fast.
> >>>>
> >>>> IPI broeadcasts on THP split are no longer required.
> >>>>
> >>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>>> Cc: Sasha Levin <sasha.levin@oracle.com>
> >>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >>>> Cc: Vlastimil Babka <vbabka@suse.cz>
> >>>> Cc: Jerome Marchand <jmarchan@redhat.com>
> >>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >>>> Cc: Hugh Dickins <hughd@google.com>
> >>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>>> Cc: Peter Xu <peterx@redhat.com>
> >>>> Cc: Yang Shi <shy828301@gmail.com>
> >>>> Signed-off-by: David Hildenbrand <david@redhat.com>
> >>>> ---
> >>>>  mm/gup.c | 5 ++---
> >>>>  1 file changed, 2 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>> index 5abdaf487460..cfe71f422787 100644
> >>>> --- a/mm/gup.c
> >>>> +++ b/mm/gup.c
> >>>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
> >>>>   *
> >>>>   * Another way to achieve this is to batch up page table containing pages
> >>>>   * belonging to more than one mm_user, then rcu_sched a callback to free those
> >>>> - * pages. Disabling interrupts will allow the fast_gup walker to both block
> >>>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> >>>> - * (which is a relatively rare event). The code below adopts this strategy.
> >>>> + * pages. Disabling interrupts will allow the fast_gup walker to block the
> >>>> + * rcu_sched callback.
> >>>
> >>> This is the comment for fast-gup in general but not only for thp split.
> >>
> >> "an IPI that we broadcast for splitting THP" is about splitting THP.
> > 
> > Ah OK.  Shall we still keep some "IPI broadcast" information here if we're
> > modifying it?  Otherwise it gives a feeling that none needs the IPIs.
> 
> I guess that's the end goal -- and we forgot about the PMD collapse case.
> 
> Are we aware of any other case that needs an IPI? I'd rather avoid
> documenting something that's no longer true.

I'm not aware of any.

> 
> > 
> > It can be dropped later if you want to rework the thp collapse side and
> > finally remove IPI dependency on fast-gup, but so far it seems to me it's
> > still needed.  Or just drop this patch until that rework happens?
> 
> The doc as is is obviously stale, why drop this patch?
> 
> We should see a fix for the THP collapse issue very soon I guess. Most
> probably this patch will go upstream after that fix.

No objection to have this patch alone as the removal statement is only
about "thp split".  But IMHO this patch alone didn't really help a great
lot, especially if you plan to have more to come that is very relevant to
this, so it'll be clearer to put them together.  Your call.

> 
> > 
> >>
> >>>
> >>> I can understand that we don't need IPI for thp split, but isn't the IPIs
> >>> still needed for thp collapse (aka pmdp_collapse_flush)?
> >>
> >> That was, unfortunately, never documented -- and as discussed in the
> >> other thread, arm64 doesn't do that IPI before collapse and might need
> >> fixing. We'll most probably end up getting rid of that
> >> (undocumented/forgotten) IPI requirement and fix it in GUP-fast by
> >> re-rechecking if the PMD changed.
> > 
> > Yeah from an initial thought that looks valid to me.  It'll also allow
> > pmdp_collapse_flush() to be dropped too, am I right?
> 
> I think the magic about pmdp_collapse_flush() is not only the IPIs, but
> that we don't perform an ordinary PMD flush but we logically flush "all
> PTEs in that range".

Yes there's a difference, good to learn that, thanks.

David Hildenbrand Sept. 1, 2022, 5:44 p.m. UTC | #9

>>> It can be dropped later if you want to rework the thp collapse side and
>>> finally remove IPI dependency on fast-gup, but so far it seems to me it's
>>> still needed.  Or just drop this patch until that rework happens?
>>
>> The doc as is is obviously stale, why drop this patch?
>>
>> We should see a fix for the THP collapse issue very soon I guess. Most
>> probably this patch will go upstream after that fix.
> 
> No objection to have this patch alone as the removal statement is only
> about "thp split".  But IMHO this patch alone didn't really help a great
> lot, especially if you plan to have more to come that is very relevant to
> this, so it'll be clearer to put them together.  Your call.

I can hold off the resend until the the fix is in place. Then I can add
to the description that we are not aware of remaining IPI dependencies,
and one undocumented case was broken and got fixed without the need for
IPIs.

Thanks!

Yang Shi Sept. 1, 2022, 5:50 p.m. UTC | #10

On Thu, Sep 1, 2022 at 9:46 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.09.22 18:40, Peter Xu wrote:
> > On Thu, Sep 01, 2022 at 06:34:41PM +0200, David Hildenbrand wrote:
> >> On 01.09.22 18:28, Peter Xu wrote:
> >>> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
> >>>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
> >>>> PMDs") didn't remove all details about the THP split requirements for
> >>>> RCU GUP-fast.
> >>>>
> >>>> IPI broeadcasts on THP split are no longer required.
> >>>>
> >>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>>> Cc: Sasha Levin <sasha.levin@oracle.com>
> >>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >>>> Cc: Vlastimil Babka <vbabka@suse.cz>
> >>>> Cc: Jerome Marchand <jmarchan@redhat.com>
> >>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >>>> Cc: Hugh Dickins <hughd@google.com>
> >>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>>> Cc: Peter Xu <peterx@redhat.com>
> >>>> Cc: Yang Shi <shy828301@gmail.com>
> >>>> Signed-off-by: David Hildenbrand <david@redhat.com>
> >>>> ---
> >>>>  mm/gup.c | 5 ++---
> >>>>  1 file changed, 2 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>> index 5abdaf487460..cfe71f422787 100644
> >>>> --- a/mm/gup.c
> >>>> +++ b/mm/gup.c
> >>>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
> >>>>   *
> >>>>   * Another way to achieve this is to batch up page table containing pages
> >>>>   * belonging to more than one mm_user, then rcu_sched a callback to free those
> >>>> - * pages. Disabling interrupts will allow the fast_gup walker to both block
> >>>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> >>>> - * (which is a relatively rare event). The code below adopts this strategy.
> >>>> + * pages. Disabling interrupts will allow the fast_gup walker to block the
> >>>> + * rcu_sched callback.
> >>>
> >>> This is the comment for fast-gup in general but not only for thp split.
> >>
> >> "an IPI that we broadcast for splitting THP" is about splitting THP.
> >
> > Ah OK.  Shall we still keep some "IPI broadcast" information here if we're
> > modifying it?  Otherwise it gives a feeling that none needs the IPIs.
>
> I guess that's the end goal -- and we forgot about the PMD collapse case.
>
> Are we aware of any other case that needs an IPI? I'd rather avoid
> documenting something that's no longer true.
>
> >
> > It can be dropped later if you want to rework the thp collapse side and
> > finally remove IPI dependency on fast-gup, but so far it seems to me it's
> > still needed.  Or just drop this patch until that rework happens?
>
> The doc as is is obviously stale, why drop this patch?
>
> We should see a fix for the THP collapse issue very soon I guess. Most
> probably this patch will go upstream after that fix.

I will be working on the fix.

>
> >
> >>
> >>>
> >>> I can understand that we don't need IPI for thp split, but isn't the IPIs
> >>> still needed for thp collapse (aka pmdp_collapse_flush)?
> >>
> >> That was, unfortunately, never documented -- and as discussed in the
> >> other thread, arm64 doesn't do that IPI before collapse and might need
> >> fixing. We'll most probably end up getting rid of that
> >> (undocumented/forgotten) IPI requirement and fix it in GUP-fast by
> >> re-rechecking if the PMD changed.
> >
> > Yeah from an initial thought that looks valid to me.  It'll also allow
> > pmdp_collapse_flush() to be dropped too, am I right?
>
> I think the magic about pmdp_collapse_flush() is not only the IPIs, but
> that we don't perform an ordinary PMD flush but we logically flush "all
> PTEs in that range".

Yeah, because THP collapse does copy the data before clearing pte. If
we want to remove pmdp_collapse_flush() by just clearing pmd, we
should clear *AND* flush pte before copying the data IIRC.

>
> Apparently, that's a difference on some architectures.
>
>
> --
> Thanks,
>
> David / dhildenb
>

Peter Xu Sept. 1, 2022, 6:07 p.m. UTC | #11

On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote:
> Yeah, because THP collapse does copy the data before clearing pte. If
> we want to remove pmdp_collapse_flush() by just clearing pmd, we
> should clear *AND* flush pte before copying the data IIRC.

Yes tlb flush is still needed.  IIUC the generic pmdp_collapse_flush() will
still be working (with the pte level flushing there) but it should just
start to work for all archs, so potentially we could drop the arch-specific
pmdp_collapse_flush()s, mostly the ppc impl.

This also reminded me that the s390 version of pmdp_collapse_flush() is a
bit weird, since it doesn't even have the tlb flush there.  I feel like
it's broken but I can't really tell whether something I've overlooked.
Worth an eye on.

Yang Shi Sept. 1, 2022, 6:35 p.m. UTC | #12

On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote:
> > Yeah, because THP collapse does copy the data before clearing pte. If
> > we want to remove pmdp_collapse_flush() by just clearing pmd, we
> > should clear *AND* flush pte before copying the data IIRC.
>
> Yes tlb flush is still needed.  IIUC the generic pmdp_collapse_flush() will
> still be working (with the pte level flushing there) but it should just
> start to work for all archs, so potentially we could drop the arch-specific
> pmdp_collapse_flush()s, mostly the ppc impl.

I'm don't know why powperpc needs to have its specific
pmdp_collapse_flush() in the first place, not only the mandatory IPI
broadcast, but also the specific implementation of pmd tlb flush. But
anyway the IPI broadcast could be removed at least IMO.

>
> This also reminded me that the s390 version of pmdp_collapse_flush() is a
> bit weird, since it doesn't even have the tlb flush there.  I feel like
> it's broken but I can't really tell whether something I've overlooked.
> Worth an eye on.

I don't know why. But if s390 doesn't flush tlb in
pmdp_collapse_flush(), then there may be data integrity problem since
the page is still writable when copying the data because pte is
cleared after data copying. Or s390 hardware does flush tlb
automatically?

>
> --
> Peter Xu
>

David Hildenbrand Sept. 2, 2022, 6:32 a.m. UTC | #13

On 01.09.22 20:35, Yang Shi wrote:
> On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote:
>>> Yeah, because THP collapse does copy the data before clearing pte. If
>>> we want to remove pmdp_collapse_flush() by just clearing pmd, we
>>> should clear *AND* flush pte before copying the data IIRC.
>>
>> Yes tlb flush is still needed.  IIUC the generic pmdp_collapse_flush() will
>> still be working (with the pte level flushing there) but it should just
>> start to work for all archs, so potentially we could drop the arch-specific
>> pmdp_collapse_flush()s, mostly the ppc impl.
> 
> I'm don't know why powperpc needs to have its specific
> pmdp_collapse_flush() in the first place, not only the mandatory IPI
> broadcast, but also the specific implementation of pmd tlb flush. But
> anyway the IPI broadcast could be removed at least IMO.
> 

pmdp_collapse_flush() is overwritten on book3s only. It either translates
to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush().


radix__pmdp_collapse_flush() has a comment explaining the situation:


+       /*
+        * pmdp collapse_flush need to ensure that there are no parallel gup
+        * walk after this call. This is needed so that we can have stable
+        * page ref count when collapsing a page. We don't allow a collapse page
+        * if we have gup taken on the page. We can ensure that by sending IPI
+        * because gup walk happens with IRQ disabled.
+        */


The comment for hash__pmdp_collapse_flush() is a bit more involved:

	/*
	 * Wait for all pending hash_page to finish. This is needed
	 * in case of subpage collapse. When we collapse normal pages
	 * to hugepage, we first clear the pmd, then invalidate all
	 * the PTE entries. The assumption here is that any low level
	 * page fault will see a none pmd and take the slow path that
	 * will wait on mmap_lock. But we could very well be in a
	 * hash_page with local ptep pointer value. Such a hash page
	 * can result in adding new HPTE entries for normal subpages.
	 * That means we could be modifying the page content as we
	 * copy them to a huge page. So wait for parallel hash_page
	 * to finish before invalidating HPTE entries. We can do this
	 * by sending an IPI to all the cpus and executing a dummy
	 * function there.
	 */

I'm not sure if that implies that the IPI is needed for some other hash-magic.

Maybe Aneesh can clarify.

>>
>> This also reminded me that the s390 version of pmdp_collapse_flush() is a
>> bit weird, since it doesn't even have the tlb flush there.  I feel like
>> it's broken but I can't really tell whether something I've overlooked.
>> Worth an eye on.
> 
> I don't know why. But if s390 doesn't flush tlb in
> pmdp_collapse_flush(), then there may be data integrity problem since
> the page is still writable when copying the data because pte is
> cleared after data copying. Or s390 hardware does flush tlb
> automatically?

s390x does a pmdp_huge_get_and_clear().

pmdp_huge_get_and_clear() does an pmdp_xchg_direct().

pmdp_xchg_direct() does an pmdp_flush_direct().

pmdp_flush_direct() issues an IDTE, which is a TLB flush.


Note that this matches ptep_get_and_clear() behavior on s390x. Quoting the comment in there:


/*
 * This is hard to understand. ptep_get_and_clear and ptep_clear_flush
 * both clear the TLB for the unmapped pte. The reason is that
 * ptep_get_and_clear is used in common code (e.g. change_pte_range)
 * to modify an active pte. The sequence is
 *   1) ptep_get_and_clear
 *   2) set_pte_at
 *   3) flush_tlb_range
 * On s390 the tlb needs to get flushed with the modification of the pte
 * if the pte is active. The only way how this can be implemented is to
 * have ptep_get_and_clear do the tlb flush. In exchange flush_tlb_range
 * is a nop.
 */

Peter Xu Sept. 2, 2022, 1:53 p.m. UTC | #14

On Fri, Sep 02, 2022 at 08:32:42AM +0200, David Hildenbrand wrote:
> Note that this matches ptep_get_and_clear() behavior on s390x. Quoting the comment in there:
> 
> 
> /*
>  * This is hard to understand. ptep_get_and_clear and ptep_clear_flush
>  * both clear the TLB for the unmapped pte. The reason is that
>  * ptep_get_and_clear is used in common code (e.g. change_pte_range)
>  * to modify an active pte. The sequence is
>  *   1) ptep_get_and_clear
>  *   2) set_pte_at
>  *   3) flush_tlb_range
>  * On s390 the tlb needs to get flushed with the modification of the pte
>  * if the pte is active. The only way how this can be implemented is to
>  * have ptep_get_and_clear do the tlb flush. In exchange flush_tlb_range
>  * is a nop.
>  */

Ah, now I kind of see why s390 has its own world of pte operations.

But then we really should be noted on reworking the generic tlb code
because s390 is always special; people will think the generic tlb API is
for tlb but no-op for s390, e.g.  anyone optimizes generic tlb flushing
it'll probably never apply to s390.

Besides performance, hopefully there'll be no case where the tlb change
will be functional then it may affect s390 too.  But I don't see any since
as long as tlb was flushed earlier than the API then it seems always safe.
Just trickier.

Yang Shi Sept. 2, 2022, 3:37 p.m. UTC | #15

On Thu, Sep 1, 2022 at 11:32 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.09.22 20:35, Yang Shi wrote:
> > On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote:
> >>
> >> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote:
> >>> Yeah, because THP collapse does copy the data before clearing pte. If
> >>> we want to remove pmdp_collapse_flush() by just clearing pmd, we
> >>> should clear *AND* flush pte before copying the data IIRC.
> >>
> >> Yes tlb flush is still needed.  IIUC the generic pmdp_collapse_flush() will
> >> still be working (with the pte level flushing there) but it should just
> >> start to work for all archs, so potentially we could drop the arch-specific
> >> pmdp_collapse_flush()s, mostly the ppc impl.
> >
> > I'm don't know why powperpc needs to have its specific
> > pmdp_collapse_flush() in the first place, not only the mandatory IPI
> > broadcast, but also the specific implementation of pmd tlb flush. But
> > anyway the IPI broadcast could be removed at least IMO.
> >
>
> pmdp_collapse_flush() is overwritten on book3s only. It either translates
> to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush().
>
>
> radix__pmdp_collapse_flush() has a comment explaining the situation:
>
>
> +       /*
> +        * pmdp collapse_flush need to ensure that there are no parallel gup
> +        * walk after this call. This is needed so that we can have stable
> +        * page ref count when collapsing a page. We don't allow a collapse page
> +        * if we have gup taken on the page. We can ensure that by sending IPI
> +        * because gup walk happens with IRQ disabled.
> +        */
>
>
> The comment for hash__pmdp_collapse_flush() is a bit more involved:
>
>         /*
>          * Wait for all pending hash_page to finish. This is needed
>          * in case of subpage collapse. When we collapse normal pages
>          * to hugepage, we first clear the pmd, then invalidate all
>          * the PTE entries. The assumption here is that any low level
>          * page fault will see a none pmd and take the slow path that
>          * will wait on mmap_lock. But we could very well be in a
>          * hash_page with local ptep pointer value. Such a hash page
>          * can result in adding new HPTE entries for normal subpages.
>          * That means we could be modifying the page content as we
>          * copy them to a huge page. So wait for parallel hash_page
>          * to finish before invalidating HPTE entries. We can do this
>          * by sending an IPI to all the cpus and executing a dummy
>          * function there.
>          */
>
> I'm not sure if that implies that the IPI is needed for some other hash-magic.

They do issue IPI broadcast to call a dummy function in order to
serialize against fast GUP, please see serialize_against_pte_lookup(),
and it does full memory barrier too.

I think the IPI broadcast could be removed once my fix is merged and
the common pmd clear and memory barrier could be consolidated, now it
is duplicated in both radix and hash.

>
> Maybe Aneesh can clarify.
>
> >>
> >> This also reminded me that the s390 version of pmdp_collapse_flush() is a
> >> bit weird, since it doesn't even have the tlb flush there.  I feel like
> >> it's broken but I can't really tell whether something I've overlooked.
> >> Worth an eye on.
> >
> > I don't know why. But if s390 doesn't flush tlb in
> > pmdp_collapse_flush(), then there may be data integrity problem since
> > the page is still writable when copying the data because pte is
> > cleared after data copying. Or s390 hardware does flush tlb
> > automatically?
>
> s390x does a pmdp_huge_get_and_clear().
>
> pmdp_huge_get_and_clear() does an pmdp_xchg_direct().
>
> pmdp_xchg_direct() does an pmdp_flush_direct().
>
> pmdp_flush_direct() issues an IDTE, which is a TLB flush.

Aha, thanks, I didn't look that deep... I stopped looking once I saw
pmdp_huge_get_and_clear(), I thought it just does clear...

>
>
> Note that this matches ptep_get_and_clear() behavior on s390x. Quoting the comment in there:
>
>
> /*
>  * This is hard to understand. ptep_get_and_clear and ptep_clear_flush
>  * both clear the TLB for the unmapped pte. The reason is that
>  * ptep_get_and_clear is used in common code (e.g. change_pte_range)
>  * to modify an active pte. The sequence is
>  *   1) ptep_get_and_clear
>  *   2) set_pte_at
>  *   3) flush_tlb_range
>  * On s390 the tlb needs to get flushed with the modification of the pte
>  * if the pte is active. The only way how this can be implemented is to
>  * have ptep_get_and_clear do the tlb flush. In exchange flush_tlb_range
>  * is a nop.
>  */
>
> --
> Thanks,
>
> David / dhildenb
>

Aneesh Kumar K.V Sept. 4, 2022, 4:49 p.m. UTC | #16

On 9/1/22 10:04 PM, David Hildenbrand wrote:
> On 01.09.22 18:28, Peter Xu wrote:
>> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
>>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
>>> PMDs") didn't remove all details about the THP split requirements for
>>> RCU GUP-fast.
>>>
>>> IPI broeadcasts on THP split are no longer required.
>>>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Cc: Sasha Levin <sasha.levin@oracle.com>
>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Jerome Marchand <jmarchan@redhat.com>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Yang Shi <shy828301@gmail.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>  mm/gup.c | 5 ++---
>>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 5abdaf487460..cfe71f422787 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
>>>   *
>>>   * Another way to achieve this is to batch up page table containing pages
>>>   * belonging to more than one mm_user, then rcu_sched a callback to free those
>>> - * pages. Disabling interrupts will allow the fast_gup walker to both block
>>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
>>> - * (which is a relatively rare event). The code below adopts this strategy.
>>> + * pages. Disabling interrupts will allow the fast_gup walker to block the
>>> + * rcu_sched callback.
>>
>> This is the comment for fast-gup in general but not only for thp split.
> 
> "an IPI that we broadcast for splitting THP" is about splitting THP.
> 
>>
>> I can understand that we don't need IPI for thp split, but isn't the IPIs
>> still needed for thp collapse (aka pmdp_collapse_flush)?
> 
> That was, unfortunately, never documented -- and as discussed in the
> other thread, arm64 doesn't do that IPI before collapse and might need
> fixing. We'll most probably end up getting rid of that
> (undocumented/forgotten) IPI requirement and fix it in GUP-fast by
> re-rechecking if the PMD changed.
> 

Can you point to the other thread ?

-aneesh

Aneesh Kumar K.V Sept. 4, 2022, 4:52 p.m. UTC | #17

On 9/2/22 12:02 PM, David Hildenbrand wrote:
> On 01.09.22 20:35, Yang Shi wrote:
>> On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote:
>>>
>>> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote:
>>>> Yeah, because THP collapse does copy the data before clearing pte. If
>>>> we want to remove pmdp_collapse_flush() by just clearing pmd, we
>>>> should clear *AND* flush pte before copying the data IIRC.
>>>
>>> Yes tlb flush is still needed.  IIUC the generic pmdp_collapse_flush() will
>>> still be working (with the pte level flushing there) but it should just
>>> start to work for all archs, so potentially we could drop the arch-specific
>>> pmdp_collapse_flush()s, mostly the ppc impl.
>>
>> I'm don't know why powperpc needs to have its specific
>> pmdp_collapse_flush() in the first place, not only the mandatory IPI
>> broadcast, but also the specific implementation of pmd tlb flush. But
>> anyway the IPI broadcast could be removed at least IMO.
>>
> 
> pmdp_collapse_flush() is overwritten on book3s only. It either translates
> to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush().
> 
> 
> radix__pmdp_collapse_flush() has a comment explaining the situation:
> 
> 
> +       /*
> +        * pmdp collapse_flush need to ensure that there are no parallel gup
> +        * walk after this call. This is needed so that we can have stable
> +        * page ref count when collapsing a page. We don't allow a collapse page
> +        * if we have gup taken on the page. We can ensure that by sending IPI
> +        * because gup walk happens with IRQ disabled.
> +        */
> 
> 
> The comment for hash__pmdp_collapse_flush() is a bit more involved:
> 
> 	/*
> 	 * Wait for all pending hash_page to finish. This is needed
> 	 * in case of subpage collapse. When we collapse normal pages
> 	 * to hugepage, we first clear the pmd, then invalidate all
> 	 * the PTE entries. The assumption here is that any low level
> 	 * page fault will see a none pmd and take the slow path that
> 	 * will wait on mmap_lock. But we could very well be in a
> 	 * hash_page with local ptep pointer value. Such a hash page
> 	 * can result in adding new HPTE entries for normal subpages.
> 	 * That means we could be modifying the page content as we
> 	 * copy them to a huge page. So wait for parallel hash_page
> 	 * to finish before invalidating HPTE entries. We can do this
> 	 * by sending an IPI to all the cpus and executing a dummy
> 	 * function there.
> 	 */
> 
> I'm not sure if that implies that the IPI is needed for some other hash-magic.
> 
> Maybe Aneesh can clarify.
> 

We still need the IPI for the hash. Another reason for architecture to override that
function is to help them use the right page size when flushing the TLB.


-aneesh

David Hildenbrand Sept. 5, 2022, 8:02 a.m. UTC | #18

On 04.09.22 18:49, Aneesh Kumar K V wrote:
> On 9/1/22 10:04 PM, David Hildenbrand wrote:
>> On 01.09.22 18:28, Peter Xu wrote:
>>> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote:
>>>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting
>>>> PMDs") didn't remove all details about the THP split requirements for
>>>> RCU GUP-fast.
>>>>
>>>> IPI broeadcasts on THP split are no longer required.
>>>>
>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Cc: Sasha Levin <sasha.levin@oracle.com>
>>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>>> Cc: Jerome Marchand <jmarchan@redhat.com>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Cc: Jason Gunthorpe <jgg@nvidia.com>
>>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: Yang Shi <shy828301@gmail.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>   mm/gup.c | 5 ++---
>>>>   1 file changed, 2 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index 5abdaf487460..cfe71f422787 100644
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
>>>>    *
>>>>    * Another way to achieve this is to batch up page table containing pages
>>>>    * belonging to more than one mm_user, then rcu_sched a callback to free those
>>>> - * pages. Disabling interrupts will allow the fast_gup walker to both block
>>>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
>>>> - * (which is a relatively rare event). The code below adopts this strategy.
>>>> + * pages. Disabling interrupts will allow the fast_gup walker to block the
>>>> + * rcu_sched callback.
>>>
>>> This is the comment for fast-gup in general but not only for thp split.
>>
>> "an IPI that we broadcast for splitting THP" is about splitting THP.
>>
>>>
>>> I can understand that we don't need IPI for thp split, but isn't the IPIs
>>> still needed for thp collapse (aka pmdp_collapse_flush)?
>>
>> That was, unfortunately, never documented -- and as discussed in the
>> other thread, arm64 doesn't do that IPI before collapse and might need
>> fixing. We'll most probably end up getting rid of that
>> (undocumented/forgotten) IPI requirement and fix it in GUP-fast by
>> re-rechecking if the PMD changed.
>>
> 
> Can you point to the other thread ?

Sure

see

https://lkml.kernel.org/r/CAHbLzkqeDAnCdt3q4E2RZw64QEzVaO_pseR3VaoHUhB+rZFcZQ@mail.gmail.com

and a resulting patch from that discussion

https://lkml.kernel.org/r/20220901222707.477402-1-shy828301@gmail.com

David Hildenbrand Sept. 5, 2022, 8:38 a.m. UTC | #19

On 04.09.22 18:52, Aneesh Kumar K V wrote:
> On 9/2/22 12:02 PM, David Hildenbrand wrote:
>> On 01.09.22 20:35, Yang Shi wrote:
>>> On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote:
>>>>> Yeah, because THP collapse does copy the data before clearing pte. If
>>>>> we want to remove pmdp_collapse_flush() by just clearing pmd, we
>>>>> should clear *AND* flush pte before copying the data IIRC.
>>>>
>>>> Yes tlb flush is still needed.  IIUC the generic pmdp_collapse_flush() will
>>>> still be working (with the pte level flushing there) but it should just
>>>> start to work for all archs, so potentially we could drop the arch-specific
>>>> pmdp_collapse_flush()s, mostly the ppc impl.
>>>
>>> I'm don't know why powperpc needs to have its specific
>>> pmdp_collapse_flush() in the first place, not only the mandatory IPI
>>> broadcast, but also the specific implementation of pmd tlb flush. But
>>> anyway the IPI broadcast could be removed at least IMO.
>>>
>>
>> pmdp_collapse_flush() is overwritten on book3s only. It either translates
>> to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush().
>>
>>
>> radix__pmdp_collapse_flush() has a comment explaining the situation:
>>
>>
>> +       /*
>> +        * pmdp collapse_flush need to ensure that there are no parallel gup
>> +        * walk after this call. This is needed so that we can have stable
>> +        * page ref count when collapsing a page. We don't allow a collapse page
>> +        * if we have gup taken on the page. We can ensure that by sending IPI
>> +        * because gup walk happens with IRQ disabled.
>> +        */
>>
>>
>> The comment for hash__pmdp_collapse_flush() is a bit more involved:
>>
>> 	/*
>> 	 * Wait for all pending hash_page to finish. This is needed
>> 	 * in case of subpage collapse. When we collapse normal pages
>> 	 * to hugepage, we first clear the pmd, then invalidate all
>> 	 * the PTE entries. The assumption here is that any low level
>> 	 * page fault will see a none pmd and take the slow path that
>> 	 * will wait on mmap_lock. But we could very well be in a
>> 	 * hash_page with local ptep pointer value. Such a hash page
>> 	 * can result in adding new HPTE entries for normal subpages.
>> 	 * That means we could be modifying the page content as we
>> 	 * copy them to a huge page. So wait for parallel hash_page
>> 	 * to finish before invalidating HPTE entries. We can do this
>> 	 * by sending an IPI to all the cpus and executing a dummy
>> 	 * function there.
>> 	 */
>>
>> I'm not sure if that implies that the IPI is needed for some other hash-magic.
>>
>> Maybe Aneesh can clarify.
>>
> 
> We still need the IPI for the hash. Another reason for architecture to override that
> function is to help them use the right page size when flushing the TLB.

Thanks for clarifying. So the radix variant wouldn't need the IPI 
anymore, once GUP-fast is handled differently, correct?

Aneesh Kumar K.V Sept. 5, 2022, 8:41 a.m. UTC | #20

On 9/5/22 2:08 PM, David Hildenbrand wrote:
> On 04.09.22 18:52, Aneesh Kumar K V wrote:
>> On 9/2/22 12:02 PM, David Hildenbrand wrote:
>>> On 01.09.22 20:35, Yang Shi wrote:
>>>> On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote:
>>>>>
>>>>> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote:
>>>>>> Yeah, because THP collapse does copy the data before clearing pte. If
>>>>>> we want to remove pmdp_collapse_flush() by just clearing pmd, we
>>>>>> should clear *AND* flush pte before copying the data IIRC.
>>>>>
>>>>> Yes tlb flush is still needed.  IIUC the generic pmdp_collapse_flush() will
>>>>> still be working (with the pte level flushing there) but it should just
>>>>> start to work for all archs, so potentially we could drop the arch-specific
>>>>> pmdp_collapse_flush()s, mostly the ppc impl.
>>>>
>>>> I'm don't know why powperpc needs to have its specific
>>>> pmdp_collapse_flush() in the first place, not only the mandatory IPI
>>>> broadcast, but also the specific implementation of pmd tlb flush. But
>>>> anyway the IPI broadcast could be removed at least IMO.
>>>>
>>>
>>> pmdp_collapse_flush() is overwritten on book3s only. It either translates
>>> to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush().
>>>
>>>
>>> radix__pmdp_collapse_flush() has a comment explaining the situation:
>>>
>>>
>>> +       /*
>>> +        * pmdp collapse_flush need to ensure that there are no parallel gup
>>> +        * walk after this call. This is needed so that we can have stable
>>> +        * page ref count when collapsing a page. We don't allow a collapse page
>>> +        * if we have gup taken on the page. We can ensure that by sending IPI
>>> +        * because gup walk happens with IRQ disabled.
>>> +        */
>>>
>>>
>>> The comment for hash__pmdp_collapse_flush() is a bit more involved:
>>>
>>>     /*
>>>      * Wait for all pending hash_page to finish. This is needed
>>>      * in case of subpage collapse. When we collapse normal pages
>>>      * to hugepage, we first clear the pmd, then invalidate all
>>>      * the PTE entries. The assumption here is that any low level
>>>      * page fault will see a none pmd and take the slow path that
>>>      * will wait on mmap_lock. But we could very well be in a
>>>      * hash_page with local ptep pointer value. Such a hash page
>>>      * can result in adding new HPTE entries for normal subpages.
>>>      * That means we could be modifying the page content as we
>>>      * copy them to a huge page. So wait for parallel hash_page
>>>      * to finish before invalidating HPTE entries. We can do this
>>>      * by sending an IPI to all the cpus and executing a dummy
>>>      * function there.
>>>      */
>>>
>>> I'm not sure if that implies that the IPI is needed for some other hash-magic.
>>>
>>> Maybe Aneesh can clarify.
>>>
>>
>> We still need the IPI for the hash. Another reason for architecture to override that
>> function is to help them use the right page size when flushing the TLB.
> 
> Thanks for clarifying. So the radix variant wouldn't need the IPI anymore, once GUP-fast is handled differently, correct?
> 

yes. With this patch https://lkml.kernel.org/r/20220901222707.477402-1-shy828301@gmail.com we can remove the
serialize_against_pte_lookup(vma->vm_mm); in radix__pmdp_collapse_flush()

-aneesh

[v1] mm/gup: adjust stale comment for RCU GUP-fast

Commit Message

Comments

Patch