Message ID | 20220901072119.37588-1-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] mm/gup: adjust stale comment for RCU GUP-fast | expand |
On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: > commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting > PMDs") didn't remove all details about the THP split requirements for > RCU GUP-fast. > > IPI broeadcasts on THP split are no longer required. > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Sasha Levin <sasha.levin@oracle.com> > Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Jerome Marchand <jmarchan@redhat.com> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Jason Gunthorpe <jgg@nvidia.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: Yang Shi <shy828301@gmail.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: > commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting > PMDs") didn't remove all details about the THP split requirements for > RCU GUP-fast. > > IPI broeadcasts on THP split are no longer required. > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Sasha Levin <sasha.levin@oracle.com> > Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Jerome Marchand <jmarchan@redhat.com> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Jason Gunthorpe <jgg@nvidia.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: Yang Shi <shy828301@gmail.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > mm/gup.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) The comment a bit above seems to need touching to: * protected from page table pages being freed from under it, and should * block any THP splits. What is the current situation for THP splits anyhow? Is there are comment in the fast pmd code explaining it? But this seems OK too Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason
On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: > commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting > PMDs") didn't remove all details about the THP split requirements for > RCU GUP-fast. > > IPI broeadcasts on THP split are no longer required. > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Sasha Levin <sasha.levin@oracle.com> > Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Jerome Marchand <jmarchan@redhat.com> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Jason Gunthorpe <jgg@nvidia.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: Yang Shi <shy828301@gmail.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > mm/gup.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 5abdaf487460..cfe71f422787 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); > * > * Another way to achieve this is to batch up page table containing pages > * belonging to more than one mm_user, then rcu_sched a callback to free those > - * pages. Disabling interrupts will allow the fast_gup walker to both block > - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs > - * (which is a relatively rare event). The code below adopts this strategy. > + * pages. Disabling interrupts will allow the fast_gup walker to block the > + * rcu_sched callback. This is the comment for fast-gup in general but not only for thp split. I can understand that we don't need IPI for thp split, but isn't the IPIs still needed for thp collapse (aka pmdp_collapse_flush)?
On 01.09.22 18:12, Jason Gunthorpe wrote: > On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: >> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting >> PMDs") didn't remove all details about the THP split requirements for >> RCU GUP-fast. >> >> IPI broeadcasts on THP split are no longer required. >> >> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> Cc: Sasha Levin <sasha.levin@oracle.com> >> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: Jerome Marchand <jmarchan@redhat.com> >> Cc: Andrea Arcangeli <aarcange@redhat.com> >> Cc: Hugh Dickins <hughd@google.com> >> Cc: Jason Gunthorpe <jgg@nvidia.com> >> Cc: John Hubbard <jhubbard@nvidia.com> >> Cc: Peter Xu <peterx@redhat.com> >> Cc: Yang Shi <shy828301@gmail.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> mm/gup.c | 5 ++--- >> 1 file changed, 2 insertions(+), 3 deletions(-) > > The comment a bit above seems to need touching to: > > * protected from page table pages being freed from under it, and should > * block any THP splits. Ah right. Will drop it as well -- thanks! > > What is the current situation for THP splits anyhow? Is there are > comment in the fast pmd code explaining it? Not aware of a comment, I think it just works naturally by always re-routing references taken on tail pages to the head page refcount. Before that, we had "Tail page refcounting", which -- as I understand -- was a mess. ddc58f27f9eee9117219936f77e90ad5b2e00e96 contains some comments. > > But this seems OK too > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > Jason >
On 01.09.22 18:28, Peter Xu wrote: > On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: >> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting >> PMDs") didn't remove all details about the THP split requirements for >> RCU GUP-fast. >> >> IPI broeadcasts on THP split are no longer required. >> >> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> Cc: Sasha Levin <sasha.levin@oracle.com> >> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: Jerome Marchand <jmarchan@redhat.com> >> Cc: Andrea Arcangeli <aarcange@redhat.com> >> Cc: Hugh Dickins <hughd@google.com> >> Cc: Jason Gunthorpe <jgg@nvidia.com> >> Cc: John Hubbard <jhubbard@nvidia.com> >> Cc: Peter Xu <peterx@redhat.com> >> Cc: Yang Shi <shy828301@gmail.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> mm/gup.c | 5 ++--- >> 1 file changed, 2 insertions(+), 3 deletions(-) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index 5abdaf487460..cfe71f422787 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); >> * >> * Another way to achieve this is to batch up page table containing pages >> * belonging to more than one mm_user, then rcu_sched a callback to free those >> - * pages. Disabling interrupts will allow the fast_gup walker to both block >> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs >> - * (which is a relatively rare event). The code below adopts this strategy. >> + * pages. Disabling interrupts will allow the fast_gup walker to block the >> + * rcu_sched callback. > > This is the comment for fast-gup in general but not only for thp split. "an IPI that we broadcast for splitting THP" is about splitting THP. > > I can understand that we don't need IPI for thp split, but isn't the IPIs > still needed for thp collapse (aka pmdp_collapse_flush)? That was, unfortunately, never documented -- and as discussed in the other thread, arm64 doesn't do that IPI before collapse and might need fixing. We'll most probably end up getting rid of that (undocumented/forgotten) IPI requirement and fix it in GUP-fast by re-rechecking if the PMD changed. Thanks!
On Thu, Sep 01, 2022 at 06:34:41PM +0200, David Hildenbrand wrote: > On 01.09.22 18:28, Peter Xu wrote: > > On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: > >> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting > >> PMDs") didn't remove all details about the THP split requirements for > >> RCU GUP-fast. > >> > >> IPI broeadcasts on THP split are no longer required. > >> > >> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > >> Cc: Sasha Levin <sasha.levin@oracle.com> > >> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > >> Cc: Vlastimil Babka <vbabka@suse.cz> > >> Cc: Jerome Marchand <jmarchan@redhat.com> > >> Cc: Andrea Arcangeli <aarcange@redhat.com> > >> Cc: Hugh Dickins <hughd@google.com> > >> Cc: Jason Gunthorpe <jgg@nvidia.com> > >> Cc: John Hubbard <jhubbard@nvidia.com> > >> Cc: Peter Xu <peterx@redhat.com> > >> Cc: Yang Shi <shy828301@gmail.com> > >> Signed-off-by: David Hildenbrand <david@redhat.com> > >> --- > >> mm/gup.c | 5 ++--- > >> 1 file changed, 2 insertions(+), 3 deletions(-) > >> > >> diff --git a/mm/gup.c b/mm/gup.c > >> index 5abdaf487460..cfe71f422787 100644 > >> --- a/mm/gup.c > >> +++ b/mm/gup.c > >> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); > >> * > >> * Another way to achieve this is to batch up page table containing pages > >> * belonging to more than one mm_user, then rcu_sched a callback to free those > >> - * pages. Disabling interrupts will allow the fast_gup walker to both block > >> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs > >> - * (which is a relatively rare event). The code below adopts this strategy. > >> + * pages. Disabling interrupts will allow the fast_gup walker to block the > >> + * rcu_sched callback. > > > > This is the comment for fast-gup in general but not only for thp split. > > "an IPI that we broadcast for splitting THP" is about splitting THP. Ah OK. Shall we still keep some "IPI broadcast" information here if we're modifying it? Otherwise it gives a feeling that none needs the IPIs. It can be dropped later if you want to rework the thp collapse side and finally remove IPI dependency on fast-gup, but so far it seems to me it's still needed. Or just drop this patch until that rework happens? > > > > > I can understand that we don't need IPI for thp split, but isn't the IPIs > > still needed for thp collapse (aka pmdp_collapse_flush)? > > That was, unfortunately, never documented -- and as discussed in the > other thread, arm64 doesn't do that IPI before collapse and might need > fixing. We'll most probably end up getting rid of that > (undocumented/forgotten) IPI requirement and fix it in GUP-fast by > re-rechecking if the PMD changed. Yeah from an initial thought that looks valid to me. It'll also allow pmdp_collapse_flush() to be dropped too, am I right?
On 01.09.22 18:40, Peter Xu wrote: > On Thu, Sep 01, 2022 at 06:34:41PM +0200, David Hildenbrand wrote: >> On 01.09.22 18:28, Peter Xu wrote: >>> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: >>>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting >>>> PMDs") didn't remove all details about the THP split requirements for >>>> RCU GUP-fast. >>>> >>>> IPI broeadcasts on THP split are no longer required. >>>> >>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >>>> Cc: Sasha Levin <sasha.levin@oracle.com> >>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> >>>> Cc: Vlastimil Babka <vbabka@suse.cz> >>>> Cc: Jerome Marchand <jmarchan@redhat.com> >>>> Cc: Andrea Arcangeli <aarcange@redhat.com> >>>> Cc: Hugh Dickins <hughd@google.com> >>>> Cc: Jason Gunthorpe <jgg@nvidia.com> >>>> Cc: John Hubbard <jhubbard@nvidia.com> >>>> Cc: Peter Xu <peterx@redhat.com> >>>> Cc: Yang Shi <shy828301@gmail.com> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>> --- >>>> mm/gup.c | 5 ++--- >>>> 1 file changed, 2 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/mm/gup.c b/mm/gup.c >>>> index 5abdaf487460..cfe71f422787 100644 >>>> --- a/mm/gup.c >>>> +++ b/mm/gup.c >>>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); >>>> * >>>> * Another way to achieve this is to batch up page table containing pages >>>> * belonging to more than one mm_user, then rcu_sched a callback to free those >>>> - * pages. Disabling interrupts will allow the fast_gup walker to both block >>>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs >>>> - * (which is a relatively rare event). The code below adopts this strategy. >>>> + * pages. Disabling interrupts will allow the fast_gup walker to block the >>>> + * rcu_sched callback. >>> >>> This is the comment for fast-gup in general but not only for thp split. >> >> "an IPI that we broadcast for splitting THP" is about splitting THP. > > Ah OK. Shall we still keep some "IPI broadcast" information here if we're > modifying it? Otherwise it gives a feeling that none needs the IPIs. I guess that's the end goal -- and we forgot about the PMD collapse case. Are we aware of any other case that needs an IPI? I'd rather avoid documenting something that's no longer true. > > It can be dropped later if you want to rework the thp collapse side and > finally remove IPI dependency on fast-gup, but so far it seems to me it's > still needed. Or just drop this patch until that rework happens? The doc as is is obviously stale, why drop this patch? We should see a fix for the THP collapse issue very soon I guess. Most probably this patch will go upstream after that fix. > >> >>> >>> I can understand that we don't need IPI for thp split, but isn't the IPIs >>> still needed for thp collapse (aka pmdp_collapse_flush)? >> >> That was, unfortunately, never documented -- and as discussed in the >> other thread, arm64 doesn't do that IPI before collapse and might need >> fixing. We'll most probably end up getting rid of that >> (undocumented/forgotten) IPI requirement and fix it in GUP-fast by >> re-rechecking if the PMD changed. > > Yeah from an initial thought that looks valid to me. It'll also allow > pmdp_collapse_flush() to be dropped too, am I right? I think the magic about pmdp_collapse_flush() is not only the IPIs, but that we don't perform an ordinary PMD flush but we logically flush "all PTEs in that range". Apparently, that's a difference on some architectures.
On Thu, Sep 01, 2022 at 06:46:13PM +0200, David Hildenbrand wrote: > On 01.09.22 18:40, Peter Xu wrote: > > On Thu, Sep 01, 2022 at 06:34:41PM +0200, David Hildenbrand wrote: > >> On 01.09.22 18:28, Peter Xu wrote: > >>> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: > >>>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting > >>>> PMDs") didn't remove all details about the THP split requirements for > >>>> RCU GUP-fast. > >>>> > >>>> IPI broeadcasts on THP split are no longer required. > >>>> > >>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > >>>> Cc: Sasha Levin <sasha.levin@oracle.com> > >>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > >>>> Cc: Vlastimil Babka <vbabka@suse.cz> > >>>> Cc: Jerome Marchand <jmarchan@redhat.com> > >>>> Cc: Andrea Arcangeli <aarcange@redhat.com> > >>>> Cc: Hugh Dickins <hughd@google.com> > >>>> Cc: Jason Gunthorpe <jgg@nvidia.com> > >>>> Cc: John Hubbard <jhubbard@nvidia.com> > >>>> Cc: Peter Xu <peterx@redhat.com> > >>>> Cc: Yang Shi <shy828301@gmail.com> > >>>> Signed-off-by: David Hildenbrand <david@redhat.com> > >>>> --- > >>>> mm/gup.c | 5 ++--- > >>>> 1 file changed, 2 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/mm/gup.c b/mm/gup.c > >>>> index 5abdaf487460..cfe71f422787 100644 > >>>> --- a/mm/gup.c > >>>> +++ b/mm/gup.c > >>>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); > >>>> * > >>>> * Another way to achieve this is to batch up page table containing pages > >>>> * belonging to more than one mm_user, then rcu_sched a callback to free those > >>>> - * pages. Disabling interrupts will allow the fast_gup walker to both block > >>>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs > >>>> - * (which is a relatively rare event). The code below adopts this strategy. > >>>> + * pages. Disabling interrupts will allow the fast_gup walker to block the > >>>> + * rcu_sched callback. > >>> > >>> This is the comment for fast-gup in general but not only for thp split. > >> > >> "an IPI that we broadcast for splitting THP" is about splitting THP. > > > > Ah OK. Shall we still keep some "IPI broadcast" information here if we're > > modifying it? Otherwise it gives a feeling that none needs the IPIs. > > I guess that's the end goal -- and we forgot about the PMD collapse case. > > Are we aware of any other case that needs an IPI? I'd rather avoid > documenting something that's no longer true. I'm not aware of any. > > > > > It can be dropped later if you want to rework the thp collapse side and > > finally remove IPI dependency on fast-gup, but so far it seems to me it's > > still needed. Or just drop this patch until that rework happens? > > The doc as is is obviously stale, why drop this patch? > > We should see a fix for the THP collapse issue very soon I guess. Most > probably this patch will go upstream after that fix. No objection to have this patch alone as the removal statement is only about "thp split". But IMHO this patch alone didn't really help a great lot, especially if you plan to have more to come that is very relevant to this, so it'll be clearer to put them together. Your call. > > > > >> > >>> > >>> I can understand that we don't need IPI for thp split, but isn't the IPIs > >>> still needed for thp collapse (aka pmdp_collapse_flush)? > >> > >> That was, unfortunately, never documented -- and as discussed in the > >> other thread, arm64 doesn't do that IPI before collapse and might need > >> fixing. We'll most probably end up getting rid of that > >> (undocumented/forgotten) IPI requirement and fix it in GUP-fast by > >> re-rechecking if the PMD changed. > > > > Yeah from an initial thought that looks valid to me. It'll also allow > > pmdp_collapse_flush() to be dropped too, am I right? > > I think the magic about pmdp_collapse_flush() is not only the IPIs, but > that we don't perform an ordinary PMD flush but we logically flush "all > PTEs in that range". Yes there's a difference, good to learn that, thanks.
>>> It can be dropped later if you want to rework the thp collapse side and >>> finally remove IPI dependency on fast-gup, but so far it seems to me it's >>> still needed. Or just drop this patch until that rework happens? >> >> The doc as is is obviously stale, why drop this patch? >> >> We should see a fix for the THP collapse issue very soon I guess. Most >> probably this patch will go upstream after that fix. > > No objection to have this patch alone as the removal statement is only > about "thp split". But IMHO this patch alone didn't really help a great > lot, especially if you plan to have more to come that is very relevant to > this, so it'll be clearer to put them together. Your call. I can hold off the resend until the the fix is in place. Then I can add to the description that we are not aware of remaining IPI dependencies, and one undocumented case was broken and got fixed without the need for IPIs. Thanks!
On Thu, Sep 1, 2022 at 9:46 AM David Hildenbrand <david@redhat.com> wrote: > > On 01.09.22 18:40, Peter Xu wrote: > > On Thu, Sep 01, 2022 at 06:34:41PM +0200, David Hildenbrand wrote: > >> On 01.09.22 18:28, Peter Xu wrote: > >>> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: > >>>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting > >>>> PMDs") didn't remove all details about the THP split requirements for > >>>> RCU GUP-fast. > >>>> > >>>> IPI broeadcasts on THP split are no longer required. > >>>> > >>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > >>>> Cc: Sasha Levin <sasha.levin@oracle.com> > >>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > >>>> Cc: Vlastimil Babka <vbabka@suse.cz> > >>>> Cc: Jerome Marchand <jmarchan@redhat.com> > >>>> Cc: Andrea Arcangeli <aarcange@redhat.com> > >>>> Cc: Hugh Dickins <hughd@google.com> > >>>> Cc: Jason Gunthorpe <jgg@nvidia.com> > >>>> Cc: John Hubbard <jhubbard@nvidia.com> > >>>> Cc: Peter Xu <peterx@redhat.com> > >>>> Cc: Yang Shi <shy828301@gmail.com> > >>>> Signed-off-by: David Hildenbrand <david@redhat.com> > >>>> --- > >>>> mm/gup.c | 5 ++--- > >>>> 1 file changed, 2 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/mm/gup.c b/mm/gup.c > >>>> index 5abdaf487460..cfe71f422787 100644 > >>>> --- a/mm/gup.c > >>>> +++ b/mm/gup.c > >>>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); > >>>> * > >>>> * Another way to achieve this is to batch up page table containing pages > >>>> * belonging to more than one mm_user, then rcu_sched a callback to free those > >>>> - * pages. Disabling interrupts will allow the fast_gup walker to both block > >>>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs > >>>> - * (which is a relatively rare event). The code below adopts this strategy. > >>>> + * pages. Disabling interrupts will allow the fast_gup walker to block the > >>>> + * rcu_sched callback. > >>> > >>> This is the comment for fast-gup in general but not only for thp split. > >> > >> "an IPI that we broadcast for splitting THP" is about splitting THP. > > > > Ah OK. Shall we still keep some "IPI broadcast" information here if we're > > modifying it? Otherwise it gives a feeling that none needs the IPIs. > > I guess that's the end goal -- and we forgot about the PMD collapse case. > > Are we aware of any other case that needs an IPI? I'd rather avoid > documenting something that's no longer true. > > > > > It can be dropped later if you want to rework the thp collapse side and > > finally remove IPI dependency on fast-gup, but so far it seems to me it's > > still needed. Or just drop this patch until that rework happens? > > The doc as is is obviously stale, why drop this patch? > > We should see a fix for the THP collapse issue very soon I guess. Most > probably this patch will go upstream after that fix. I will be working on the fix. > > > > >> > >>> > >>> I can understand that we don't need IPI for thp split, but isn't the IPIs > >>> still needed for thp collapse (aka pmdp_collapse_flush)? > >> > >> That was, unfortunately, never documented -- and as discussed in the > >> other thread, arm64 doesn't do that IPI before collapse and might need > >> fixing. We'll most probably end up getting rid of that > >> (undocumented/forgotten) IPI requirement and fix it in GUP-fast by > >> re-rechecking if the PMD changed. > > > > Yeah from an initial thought that looks valid to me. It'll also allow > > pmdp_collapse_flush() to be dropped too, am I right? > > I think the magic about pmdp_collapse_flush() is not only the IPIs, but > that we don't perform an ordinary PMD flush but we logically flush "all > PTEs in that range". Yeah, because THP collapse does copy the data before clearing pte. If we want to remove pmdp_collapse_flush() by just clearing pmd, we should clear *AND* flush pte before copying the data IIRC. > > Apparently, that's a difference on some architectures. > > > -- > Thanks, > > David / dhildenb >
On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote: > Yeah, because THP collapse does copy the data before clearing pte. If > we want to remove pmdp_collapse_flush() by just clearing pmd, we > should clear *AND* flush pte before copying the data IIRC. Yes tlb flush is still needed. IIUC the generic pmdp_collapse_flush() will still be working (with the pte level flushing there) but it should just start to work for all archs, so potentially we could drop the arch-specific pmdp_collapse_flush()s, mostly the ppc impl. This also reminded me that the s390 version of pmdp_collapse_flush() is a bit weird, since it doesn't even have the tlb flush there. I feel like it's broken but I can't really tell whether something I've overlooked. Worth an eye on.
On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote: > > On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote: > > Yeah, because THP collapse does copy the data before clearing pte. If > > we want to remove pmdp_collapse_flush() by just clearing pmd, we > > should clear *AND* flush pte before copying the data IIRC. > > Yes tlb flush is still needed. IIUC the generic pmdp_collapse_flush() will > still be working (with the pte level flushing there) but it should just > start to work for all archs, so potentially we could drop the arch-specific > pmdp_collapse_flush()s, mostly the ppc impl. I'm don't know why powperpc needs to have its specific pmdp_collapse_flush() in the first place, not only the mandatory IPI broadcast, but also the specific implementation of pmd tlb flush. But anyway the IPI broadcast could be removed at least IMO. > > This also reminded me that the s390 version of pmdp_collapse_flush() is a > bit weird, since it doesn't even have the tlb flush there. I feel like > it's broken but I can't really tell whether something I've overlooked. > Worth an eye on. I don't know why. But if s390 doesn't flush tlb in pmdp_collapse_flush(), then there may be data integrity problem since the page is still writable when copying the data because pte is cleared after data copying. Or s390 hardware does flush tlb automatically? > > -- > Peter Xu >
On 01.09.22 20:35, Yang Shi wrote: > On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote: >> >> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote: >>> Yeah, because THP collapse does copy the data before clearing pte. If >>> we want to remove pmdp_collapse_flush() by just clearing pmd, we >>> should clear *AND* flush pte before copying the data IIRC. >> >> Yes tlb flush is still needed. IIUC the generic pmdp_collapse_flush() will >> still be working (with the pte level flushing there) but it should just >> start to work for all archs, so potentially we could drop the arch-specific >> pmdp_collapse_flush()s, mostly the ppc impl. > > I'm don't know why powperpc needs to have its specific > pmdp_collapse_flush() in the first place, not only the mandatory IPI > broadcast, but also the specific implementation of pmd tlb flush. But > anyway the IPI broadcast could be removed at least IMO. > pmdp_collapse_flush() is overwritten on book3s only. It either translates to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush(). radix__pmdp_collapse_flush() has a comment explaining the situation: + /* + * pmdp collapse_flush need to ensure that there are no parallel gup + * walk after this call. This is needed so that we can have stable + * page ref count when collapsing a page. We don't allow a collapse page + * if we have gup taken on the page. We can ensure that by sending IPI + * because gup walk happens with IRQ disabled. + */ The comment for hash__pmdp_collapse_flush() is a bit more involved: /* * Wait for all pending hash_page to finish. This is needed * in case of subpage collapse. When we collapse normal pages * to hugepage, we first clear the pmd, then invalidate all * the PTE entries. The assumption here is that any low level * page fault will see a none pmd and take the slow path that * will wait on mmap_lock. But we could very well be in a * hash_page with local ptep pointer value. Such a hash page * can result in adding new HPTE entries for normal subpages. * That means we could be modifying the page content as we * copy them to a huge page. So wait for parallel hash_page * to finish before invalidating HPTE entries. We can do this * by sending an IPI to all the cpus and executing a dummy * function there. */ I'm not sure if that implies that the IPI is needed for some other hash-magic. Maybe Aneesh can clarify. >> >> This also reminded me that the s390 version of pmdp_collapse_flush() is a >> bit weird, since it doesn't even have the tlb flush there. I feel like >> it's broken but I can't really tell whether something I've overlooked. >> Worth an eye on. > > I don't know why. But if s390 doesn't flush tlb in > pmdp_collapse_flush(), then there may be data integrity problem since > the page is still writable when copying the data because pte is > cleared after data copying. Or s390 hardware does flush tlb > automatically? s390x does a pmdp_huge_get_and_clear(). pmdp_huge_get_and_clear() does an pmdp_xchg_direct(). pmdp_xchg_direct() does an pmdp_flush_direct(). pmdp_flush_direct() issues an IDTE, which is a TLB flush. Note that this matches ptep_get_and_clear() behavior on s390x. Quoting the comment in there: /* * This is hard to understand. ptep_get_and_clear and ptep_clear_flush * both clear the TLB for the unmapped pte. The reason is that * ptep_get_and_clear is used in common code (e.g. change_pte_range) * to modify an active pte. The sequence is * 1) ptep_get_and_clear * 2) set_pte_at * 3) flush_tlb_range * On s390 the tlb needs to get flushed with the modification of the pte * if the pte is active. The only way how this can be implemented is to * have ptep_get_and_clear do the tlb flush. In exchange flush_tlb_range * is a nop. */
On Fri, Sep 02, 2022 at 08:32:42AM +0200, David Hildenbrand wrote: > Note that this matches ptep_get_and_clear() behavior on s390x. Quoting the comment in there: > > > /* > * This is hard to understand. ptep_get_and_clear and ptep_clear_flush > * both clear the TLB for the unmapped pte. The reason is that > * ptep_get_and_clear is used in common code (e.g. change_pte_range) > * to modify an active pte. The sequence is > * 1) ptep_get_and_clear > * 2) set_pte_at > * 3) flush_tlb_range > * On s390 the tlb needs to get flushed with the modification of the pte > * if the pte is active. The only way how this can be implemented is to > * have ptep_get_and_clear do the tlb flush. In exchange flush_tlb_range > * is a nop. > */ Ah, now I kind of see why s390 has its own world of pte operations. But then we really should be noted on reworking the generic tlb code because s390 is always special; people will think the generic tlb API is for tlb but no-op for s390, e.g. anyone optimizes generic tlb flushing it'll probably never apply to s390. Besides performance, hopefully there'll be no case where the tlb change will be functional then it may affect s390 too. But I don't see any since as long as tlb was flushed earlier than the API then it seems always safe. Just trickier.
On Thu, Sep 1, 2022 at 11:32 PM David Hildenbrand <david@redhat.com> wrote: > > On 01.09.22 20:35, Yang Shi wrote: > > On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote: > >> > >> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote: > >>> Yeah, because THP collapse does copy the data before clearing pte. If > >>> we want to remove pmdp_collapse_flush() by just clearing pmd, we > >>> should clear *AND* flush pte before copying the data IIRC. > >> > >> Yes tlb flush is still needed. IIUC the generic pmdp_collapse_flush() will > >> still be working (with the pte level flushing there) but it should just > >> start to work for all archs, so potentially we could drop the arch-specific > >> pmdp_collapse_flush()s, mostly the ppc impl. > > > > I'm don't know why powperpc needs to have its specific > > pmdp_collapse_flush() in the first place, not only the mandatory IPI > > broadcast, but also the specific implementation of pmd tlb flush. But > > anyway the IPI broadcast could be removed at least IMO. > > > > pmdp_collapse_flush() is overwritten on book3s only. It either translates > to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush(). > > > radix__pmdp_collapse_flush() has a comment explaining the situation: > > > + /* > + * pmdp collapse_flush need to ensure that there are no parallel gup > + * walk after this call. This is needed so that we can have stable > + * page ref count when collapsing a page. We don't allow a collapse page > + * if we have gup taken on the page. We can ensure that by sending IPI > + * because gup walk happens with IRQ disabled. > + */ > > > The comment for hash__pmdp_collapse_flush() is a bit more involved: > > /* > * Wait for all pending hash_page to finish. This is needed > * in case of subpage collapse. When we collapse normal pages > * to hugepage, we first clear the pmd, then invalidate all > * the PTE entries. The assumption here is that any low level > * page fault will see a none pmd and take the slow path that > * will wait on mmap_lock. But we could very well be in a > * hash_page with local ptep pointer value. Such a hash page > * can result in adding new HPTE entries for normal subpages. > * That means we could be modifying the page content as we > * copy them to a huge page. So wait for parallel hash_page > * to finish before invalidating HPTE entries. We can do this > * by sending an IPI to all the cpus and executing a dummy > * function there. > */ > > I'm not sure if that implies that the IPI is needed for some other hash-magic. They do issue IPI broadcast to call a dummy function in order to serialize against fast GUP, please see serialize_against_pte_lookup(), and it does full memory barrier too. I think the IPI broadcast could be removed once my fix is merged and the common pmd clear and memory barrier could be consolidated, now it is duplicated in both radix and hash. > > Maybe Aneesh can clarify. > > >> > >> This also reminded me that the s390 version of pmdp_collapse_flush() is a > >> bit weird, since it doesn't even have the tlb flush there. I feel like > >> it's broken but I can't really tell whether something I've overlooked. > >> Worth an eye on. > > > > I don't know why. But if s390 doesn't flush tlb in > > pmdp_collapse_flush(), then there may be data integrity problem since > > the page is still writable when copying the data because pte is > > cleared after data copying. Or s390 hardware does flush tlb > > automatically? > > s390x does a pmdp_huge_get_and_clear(). > > pmdp_huge_get_and_clear() does an pmdp_xchg_direct(). > > pmdp_xchg_direct() does an pmdp_flush_direct(). > > pmdp_flush_direct() issues an IDTE, which is a TLB flush. Aha, thanks, I didn't look that deep... I stopped looking once I saw pmdp_huge_get_and_clear(), I thought it just does clear... > > > Note that this matches ptep_get_and_clear() behavior on s390x. Quoting the comment in there: > > > /* > * This is hard to understand. ptep_get_and_clear and ptep_clear_flush > * both clear the TLB for the unmapped pte. The reason is that > * ptep_get_and_clear is used in common code (e.g. change_pte_range) > * to modify an active pte. The sequence is > * 1) ptep_get_and_clear > * 2) set_pte_at > * 3) flush_tlb_range > * On s390 the tlb needs to get flushed with the modification of the pte > * if the pte is active. The only way how this can be implemented is to > * have ptep_get_and_clear do the tlb flush. In exchange flush_tlb_range > * is a nop. > */ > > -- > Thanks, > > David / dhildenb >
On 9/1/22 10:04 PM, David Hildenbrand wrote: > On 01.09.22 18:28, Peter Xu wrote: >> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: >>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting >>> PMDs") didn't remove all details about the THP split requirements for >>> RCU GUP-fast. >>> >>> IPI broeadcasts on THP split are no longer required. >>> >>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >>> Cc: Sasha Levin <sasha.levin@oracle.com> >>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> >>> Cc: Vlastimil Babka <vbabka@suse.cz> >>> Cc: Jerome Marchand <jmarchan@redhat.com> >>> Cc: Andrea Arcangeli <aarcange@redhat.com> >>> Cc: Hugh Dickins <hughd@google.com> >>> Cc: Jason Gunthorpe <jgg@nvidia.com> >>> Cc: John Hubbard <jhubbard@nvidia.com> >>> Cc: Peter Xu <peterx@redhat.com> >>> Cc: Yang Shi <shy828301@gmail.com> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> --- >>> mm/gup.c | 5 ++--- >>> 1 file changed, 2 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/gup.c b/mm/gup.c >>> index 5abdaf487460..cfe71f422787 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); >>> * >>> * Another way to achieve this is to batch up page table containing pages >>> * belonging to more than one mm_user, then rcu_sched a callback to free those >>> - * pages. Disabling interrupts will allow the fast_gup walker to both block >>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs >>> - * (which is a relatively rare event). The code below adopts this strategy. >>> + * pages. Disabling interrupts will allow the fast_gup walker to block the >>> + * rcu_sched callback. >> >> This is the comment for fast-gup in general but not only for thp split. > > "an IPI that we broadcast for splitting THP" is about splitting THP. > >> >> I can understand that we don't need IPI for thp split, but isn't the IPIs >> still needed for thp collapse (aka pmdp_collapse_flush)? > > That was, unfortunately, never documented -- and as discussed in the > other thread, arm64 doesn't do that IPI before collapse and might need > fixing. We'll most probably end up getting rid of that > (undocumented/forgotten) IPI requirement and fix it in GUP-fast by > re-rechecking if the PMD changed. > Can you point to the other thread ? -aneesh
On 9/2/22 12:02 PM, David Hildenbrand wrote: > On 01.09.22 20:35, Yang Shi wrote: >> On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote: >>> >>> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote: >>>> Yeah, because THP collapse does copy the data before clearing pte. If >>>> we want to remove pmdp_collapse_flush() by just clearing pmd, we >>>> should clear *AND* flush pte before copying the data IIRC. >>> >>> Yes tlb flush is still needed. IIUC the generic pmdp_collapse_flush() will >>> still be working (with the pte level flushing there) but it should just >>> start to work for all archs, so potentially we could drop the arch-specific >>> pmdp_collapse_flush()s, mostly the ppc impl. >> >> I'm don't know why powperpc needs to have its specific >> pmdp_collapse_flush() in the first place, not only the mandatory IPI >> broadcast, but also the specific implementation of pmd tlb flush. But >> anyway the IPI broadcast could be removed at least IMO. >> > > pmdp_collapse_flush() is overwritten on book3s only. It either translates > to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush(). > > > radix__pmdp_collapse_flush() has a comment explaining the situation: > > > + /* > + * pmdp collapse_flush need to ensure that there are no parallel gup > + * walk after this call. This is needed so that we can have stable > + * page ref count when collapsing a page. We don't allow a collapse page > + * if we have gup taken on the page. We can ensure that by sending IPI > + * because gup walk happens with IRQ disabled. > + */ > > > The comment for hash__pmdp_collapse_flush() is a bit more involved: > > /* > * Wait for all pending hash_page to finish. This is needed > * in case of subpage collapse. When we collapse normal pages > * to hugepage, we first clear the pmd, then invalidate all > * the PTE entries. The assumption here is that any low level > * page fault will see a none pmd and take the slow path that > * will wait on mmap_lock. But we could very well be in a > * hash_page with local ptep pointer value. Such a hash page > * can result in adding new HPTE entries for normal subpages. > * That means we could be modifying the page content as we > * copy them to a huge page. So wait for parallel hash_page > * to finish before invalidating HPTE entries. We can do this > * by sending an IPI to all the cpus and executing a dummy > * function there. > */ > > I'm not sure if that implies that the IPI is needed for some other hash-magic. > > Maybe Aneesh can clarify. > We still need the IPI for the hash. Another reason for architecture to override that function is to help them use the right page size when flushing the TLB. -aneesh
On 04.09.22 18:49, Aneesh Kumar K V wrote: > On 9/1/22 10:04 PM, David Hildenbrand wrote: >> On 01.09.22 18:28, Peter Xu wrote: >>> On Thu, Sep 01, 2022 at 09:21:19AM +0200, David Hildenbrand wrote: >>>> commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting >>>> PMDs") didn't remove all details about the THP split requirements for >>>> RCU GUP-fast. >>>> >>>> IPI broeadcasts on THP split are no longer required. >>>> >>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >>>> Cc: Sasha Levin <sasha.levin@oracle.com> >>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> >>>> Cc: Vlastimil Babka <vbabka@suse.cz> >>>> Cc: Jerome Marchand <jmarchan@redhat.com> >>>> Cc: Andrea Arcangeli <aarcange@redhat.com> >>>> Cc: Hugh Dickins <hughd@google.com> >>>> Cc: Jason Gunthorpe <jgg@nvidia.com> >>>> Cc: John Hubbard <jhubbard@nvidia.com> >>>> Cc: Peter Xu <peterx@redhat.com> >>>> Cc: Yang Shi <shy828301@gmail.com> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>> --- >>>> mm/gup.c | 5 ++--- >>>> 1 file changed, 2 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/mm/gup.c b/mm/gup.c >>>> index 5abdaf487460..cfe71f422787 100644 >>>> --- a/mm/gup.c >>>> +++ b/mm/gup.c >>>> @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); >>>> * >>>> * Another way to achieve this is to batch up page table containing pages >>>> * belonging to more than one mm_user, then rcu_sched a callback to free those >>>> - * pages. Disabling interrupts will allow the fast_gup walker to both block >>>> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs >>>> - * (which is a relatively rare event). The code below adopts this strategy. >>>> + * pages. Disabling interrupts will allow the fast_gup walker to block the >>>> + * rcu_sched callback. >>> >>> This is the comment for fast-gup in general but not only for thp split. >> >> "an IPI that we broadcast for splitting THP" is about splitting THP. >> >>> >>> I can understand that we don't need IPI for thp split, but isn't the IPIs >>> still needed for thp collapse (aka pmdp_collapse_flush)? >> >> That was, unfortunately, never documented -- and as discussed in the >> other thread, arm64 doesn't do that IPI before collapse and might need >> fixing. We'll most probably end up getting rid of that >> (undocumented/forgotten) IPI requirement and fix it in GUP-fast by >> re-rechecking if the PMD changed. >> > > Can you point to the other thread ? Sure see https://lkml.kernel.org/r/CAHbLzkqeDAnCdt3q4E2RZw64QEzVaO_pseR3VaoHUhB+rZFcZQ@mail.gmail.com and a resulting patch from that discussion https://lkml.kernel.org/r/20220901222707.477402-1-shy828301@gmail.com
On 04.09.22 18:52, Aneesh Kumar K V wrote: > On 9/2/22 12:02 PM, David Hildenbrand wrote: >> On 01.09.22 20:35, Yang Shi wrote: >>> On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote: >>>> >>>> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote: >>>>> Yeah, because THP collapse does copy the data before clearing pte. If >>>>> we want to remove pmdp_collapse_flush() by just clearing pmd, we >>>>> should clear *AND* flush pte before copying the data IIRC. >>>> >>>> Yes tlb flush is still needed. IIUC the generic pmdp_collapse_flush() will >>>> still be working (with the pte level flushing there) but it should just >>>> start to work for all archs, so potentially we could drop the arch-specific >>>> pmdp_collapse_flush()s, mostly the ppc impl. >>> >>> I'm don't know why powperpc needs to have its specific >>> pmdp_collapse_flush() in the first place, not only the mandatory IPI >>> broadcast, but also the specific implementation of pmd tlb flush. But >>> anyway the IPI broadcast could be removed at least IMO. >>> >> >> pmdp_collapse_flush() is overwritten on book3s only. It either translates >> to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush(). >> >> >> radix__pmdp_collapse_flush() has a comment explaining the situation: >> >> >> + /* >> + * pmdp collapse_flush need to ensure that there are no parallel gup >> + * walk after this call. This is needed so that we can have stable >> + * page ref count when collapsing a page. We don't allow a collapse page >> + * if we have gup taken on the page. We can ensure that by sending IPI >> + * because gup walk happens with IRQ disabled. >> + */ >> >> >> The comment for hash__pmdp_collapse_flush() is a bit more involved: >> >> /* >> * Wait for all pending hash_page to finish. This is needed >> * in case of subpage collapse. When we collapse normal pages >> * to hugepage, we first clear the pmd, then invalidate all >> * the PTE entries. The assumption here is that any low level >> * page fault will see a none pmd and take the slow path that >> * will wait on mmap_lock. But we could very well be in a >> * hash_page with local ptep pointer value. Such a hash page >> * can result in adding new HPTE entries for normal subpages. >> * That means we could be modifying the page content as we >> * copy them to a huge page. So wait for parallel hash_page >> * to finish before invalidating HPTE entries. We can do this >> * by sending an IPI to all the cpus and executing a dummy >> * function there. >> */ >> >> I'm not sure if that implies that the IPI is needed for some other hash-magic. >> >> Maybe Aneesh can clarify. >> > > We still need the IPI for the hash. Another reason for architecture to override that > function is to help them use the right page size when flushing the TLB. Thanks for clarifying. So the radix variant wouldn't need the IPI anymore, once GUP-fast is handled differently, correct?
On 9/5/22 2:08 PM, David Hildenbrand wrote: > On 04.09.22 18:52, Aneesh Kumar K V wrote: >> On 9/2/22 12:02 PM, David Hildenbrand wrote: >>> On 01.09.22 20:35, Yang Shi wrote: >>>> On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@redhat.com> wrote: >>>>> >>>>> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote: >>>>>> Yeah, because THP collapse does copy the data before clearing pte. If >>>>>> we want to remove pmdp_collapse_flush() by just clearing pmd, we >>>>>> should clear *AND* flush pte before copying the data IIRC. >>>>> >>>>> Yes tlb flush is still needed. IIUC the generic pmdp_collapse_flush() will >>>>> still be working (with the pte level flushing there) but it should just >>>>> start to work for all archs, so potentially we could drop the arch-specific >>>>> pmdp_collapse_flush()s, mostly the ppc impl. >>>> >>>> I'm don't know why powperpc needs to have its specific >>>> pmdp_collapse_flush() in the first place, not only the mandatory IPI >>>> broadcast, but also the specific implementation of pmd tlb flush. But >>>> anyway the IPI broadcast could be removed at least IMO. >>>> >>> >>> pmdp_collapse_flush() is overwritten on book3s only. It either translates >>> to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush(). >>> >>> >>> radix__pmdp_collapse_flush() has a comment explaining the situation: >>> >>> >>> + /* >>> + * pmdp collapse_flush need to ensure that there are no parallel gup >>> + * walk after this call. This is needed so that we can have stable >>> + * page ref count when collapsing a page. We don't allow a collapse page >>> + * if we have gup taken on the page. We can ensure that by sending IPI >>> + * because gup walk happens with IRQ disabled. >>> + */ >>> >>> >>> The comment for hash__pmdp_collapse_flush() is a bit more involved: >>> >>> /* >>> * Wait for all pending hash_page to finish. This is needed >>> * in case of subpage collapse. When we collapse normal pages >>> * to hugepage, we first clear the pmd, then invalidate all >>> * the PTE entries. The assumption here is that any low level >>> * page fault will see a none pmd and take the slow path that >>> * will wait on mmap_lock. But we could very well be in a >>> * hash_page with local ptep pointer value. Such a hash page >>> * can result in adding new HPTE entries for normal subpages. >>> * That means we could be modifying the page content as we >>> * copy them to a huge page. So wait for parallel hash_page >>> * to finish before invalidating HPTE entries. We can do this >>> * by sending an IPI to all the cpus and executing a dummy >>> * function there. >>> */ >>> >>> I'm not sure if that implies that the IPI is needed for some other hash-magic. >>> >>> Maybe Aneesh can clarify. >>> >> >> We still need the IPI for the hash. Another reason for architecture to override that >> function is to help them use the right page size when flushing the TLB. > > Thanks for clarifying. So the radix variant wouldn't need the IPI anymore, once GUP-fast is handled differently, correct? > yes. With this patch https://lkml.kernel.org/r/20220901222707.477402-1-shy828301@gmail.com we can remove the serialize_against_pte_lookup(vma->vm_mm); in radix__pmdp_collapse_flush() -aneesh
diff --git a/mm/gup.c b/mm/gup.c index 5abdaf487460..cfe71f422787 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2309,9 +2309,8 @@ EXPORT_SYMBOL(get_user_pages_unlocked); * * Another way to achieve this is to batch up page table containing pages * belonging to more than one mm_user, then rcu_sched a callback to free those - * pages. Disabling interrupts will allow the fast_gup walker to both block - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs - * (which is a relatively rare event). The code below adopts this strategy. + * pages. Disabling interrupts will allow the fast_gup walker to block the + * rcu_sched callback. * * Before activating this code, please be aware that the following assumptions * are currently made:
commit 4b471e8898c3 ("mm, thp: remove infrastructure for handling splitting PMDs") didn't remove all details about the THP split requirements for RCU GUP-fast. IPI broeadcasts on THP split are no longer required. Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- mm/gup.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)