Message ID | 20240611153216.2794513-1-abrestic@rivosinc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/memory: Don't require head page for do_set_pmd() | expand |
On 11.06.24 17:32, Andrew Bresticker wrote: > The requirement that the head page be passed to do_set_pmd() was added > in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> > folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the > finish_fault() and filemap_map_pages() paths if the page to be inserted > is anything but the head page for an otherwise suitable vma and pmd-sized > page. > > Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") > Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> > --- > mm/memory.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 0f47a533014e..a1fce5ddacb3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) > return ret; > > - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) > + if (folio_order(folio) != HPAGE_PMD_ORDER) > return ret; > + page = &folio->page; > > /* > * Just backoff if any subpage of a THP is corrupted otherwise Acked-by: David Hildenbrand <david@redhat.com>
On Tue, Jun 11, 2024 at 08:32:16AM -0700, Andrew Bresticker wrote: > - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) > + if (folio_order(folio) != HPAGE_PMD_ORDER) > return ret; > + page = &folio->page; This works today, but in about six months time it's going to be a pain. + page = folio_page(folio, 0); is the one which works today and in the future. Of course, as I'm writing this now I'm thinking we could make this work for folios which are larger than PMD size. Which we currently prohibit (at least in part because this function doesn't work with folios larger than PMD size) pgoff_t idx = ((haddr - vma->vm_start) / PAGE_SIZE) + vma->vm_pgoff; page = folio_file_page(folio, idx); (there might be a more succinct way to write that, and I might have messed up the polarity on something). Worth starting to fix these places that don't work with folios larger than PMD size? I think it'd be worth supporting, eg 4MB folios. Current bandwidth of a gen5 x8 link is 32GB/s, so we can read 8,000 4MB pages/second at a latency of 125us. My laptop only has a gen4 x2 SSD, so a 4MB folio would take 1ms to read which might be a little excessive for normal use. Of course allocating an order 10 folio is hard, so it's not going to happen terribly often anyway.
On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote: > On 11.06.24 17:32, Andrew Bresticker wrote: > > The requirement that the head page be passed to do_set_pmd() was added > > in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> > > folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the > > finish_fault() and filemap_map_pages() paths if the page to be inserted > > is anything but the head page for an otherwise suitable vma and pmd-sized > > page. > > > > Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") > > Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> > > --- > > mm/memory.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 0f47a533014e..a1fce5ddacb3 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > > if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) > > return ret; > > > > - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) > > + if (folio_order(folio) != HPAGE_PMD_ORDER) > > return ret; > > + page = &folio->page; > > > > /* > > * Just backoff if any subpage of a THP is corrupted otherwise > > Acked-by: David Hildenbrand <david@redhat.com> You know what I'm going to ask ;) I'm assuming that the runtime effects are "small performance optimization" and that "should we backport the fix" is "no".
On Tue, 11 Jun 2024 19:03:29 +0100 Matthew Wilcox <willy@infradead.org> wrote: > On Tue, Jun 11, 2024 at 08:32:16AM -0700, Andrew Bresticker wrote: > > - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) > > + if (folio_order(folio) != HPAGE_PMD_ORDER) > > return ret; > > + page = &folio->page; > > This works today, but in about six months time it's going to be a pain. > > + page = folio_page(folio, 0); > > is the one which works today and in the future. I was wondering about that. hp2:/usr/src/25> fgrep "&folio->page" mm/*.c | wc -l 84 hp2:/usr/src/25> fgrep "folio_page(" mm/*.c | wc -l 35 Should these all be converted? What's the general rule here?
On Tue, Jun 11, 2024 at 11:06:22AM -0700, Andrew Morton wrote: > On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote: > > > On 11.06.24 17:32, Andrew Bresticker wrote: > > > The requirement that the head page be passed to do_set_pmd() was added > > > in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> > > > folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the > > > finish_fault() and filemap_map_pages() paths if the page to be inserted > > > is anything but the head page for an otherwise suitable vma and pmd-sized > > > page. > > > > > > Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") > > > Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> > > > --- > > > mm/memory.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 0f47a533014e..a1fce5ddacb3 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > > > if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) > > > return ret; > > > > > > - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) > > > + if (folio_order(folio) != HPAGE_PMD_ORDER) > > > return ret; > > > + page = &folio->page; > > > > > > /* > > > * Just backoff if any subpage of a THP is corrupted otherwise > > > > Acked-by: David Hildenbrand <david@redhat.com> > > You know what I'm going to ask ;) I'm assuming that the runtime effects > are "small performance optimization" and that "should we backport the > fix" is "no". We're going to stop using PMDs to map large folios unless the fault is within the first 4KiB of the PMD. No idea how many workloads that affects, but it only needs to be backported as far as v6.8, so we may as well backport it.
On Tue, Jun 11, 2024 at 11:21:31AM -0700, Andrew Morton wrote: > On Tue, 11 Jun 2024 19:03:29 +0100 Matthew Wilcox <willy@infradead.org> wrote: > > > On Tue, Jun 11, 2024 at 08:32:16AM -0700, Andrew Bresticker wrote: > > > - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) > > > + if (folio_order(folio) != HPAGE_PMD_ORDER) > > > return ret; > > > + page = &folio->page; > > > > This works today, but in about six months time it's going to be a pain. > > > > + page = folio_page(folio, 0); > > > > is the one which works today and in the future. > > I was wondering about that. > > hp2:/usr/src/25> fgrep "&folio->page" mm/*.c | wc -l > 84 > hp2:/usr/src/25> fgrep "folio_page(" mm/*.c | wc -l > 35 > > Should these all be converted? What's the general rule here? The rule is ... - If we haven't thought about it, use &folio->page to indicate that somebody needs to think about it. - If the code needs to be modified to split folio and page apart, use &folio->page. - If the code is part of compat code which is going to have to be removed, use &folio->page (eg do_read_cache_page()). To *think* about it, and use folio_page() or folio_file_page(), don't just blindly pass 0 as the second argument. Think about which page within the folio is expected by the function you're working on. Often that is "the first one!" and so folio_page(folio, 0) is the right answer. But that should be justified. It might be the right answer is "Oh, that function should take a folio".
On Tue, 11 Jun 2024 19:22:03 +0100 Matthew Wilcox <willy@infradead.org> wrote: > On Tue, Jun 11, 2024 at 11:06:22AM -0700, Andrew Morton wrote: > > On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote: > > > > > On 11.06.24 17:32, Andrew Bresticker wrote: > > > > The requirement that the head page be passed to do_set_pmd() was added > > > > in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> > > > > folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the > > > > finish_fault() and filemap_map_pages() paths if the page to be inserted > > > > is anything but the head page for an otherwise suitable vma and pmd-sized > > > > page. > > > > > > > > Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") > > > > Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> > > > > --- > > > > mm/memory.c | 3 ++- > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > index 0f47a533014e..a1fce5ddacb3 100644 > > > > --- a/mm/memory.c > > > > +++ b/mm/memory.c > > > > @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > > > > if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) > > > > return ret; > > > > > > > > - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) > > > > + if (folio_order(folio) != HPAGE_PMD_ORDER) > > > > return ret; > > > > + page = &folio->page; > > > > > > > > /* > > > > * Just backoff if any subpage of a THP is corrupted otherwise > > > > > > Acked-by: David Hildenbrand <david@redhat.com> > > > > You know what I'm going to ask ;) I'm assuming that the runtime effects > > are "small performance optimization" and that "should we backport the > > fix" is "no". > > We're going to stop using PMDs to map large folios unless the fault is > within the first 4KiB of the PMD. No idea how many workloads that > affects, but it only needs to be backported as far as v6.8, so we > may as well backport it. OK, thanks, I pasted the above text and added the cc:stable. I didn't move it into the hotfixes queue - it's a non-trivial behavioral change and extra test time seems prudent(?).
On Tue, 11 Jun 2024, Andrew Morton wrote: > On Tue, 11 Jun 2024 19:22:03 +0100 Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Jun 11, 2024 at 11:06:22AM -0700, Andrew Morton wrote: > > > On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote: > > > > > > > On 11.06.24 17:32, Andrew Bresticker wrote: > > > > > The requirement that the head page be passed to do_set_pmd() was added > > > > > in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> > > > > > folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the > > > > > finish_fault() and filemap_map_pages() paths if the page to be inserted > > > > > is anything but the head page for an otherwise suitable vma and pmd-sized > > > > > page. > > > > > > > > > > Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") > > > > > Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> > > > > > --- > > > > > mm/memory.c | 3 ++- > > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > > index 0f47a533014e..a1fce5ddacb3 100644 > > > > > --- a/mm/memory.c > > > > > +++ b/mm/memory.c > > > > > @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > > > > > if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) > > > > > return ret; > > > > > > > > > > - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) > > > > > + if (folio_order(folio) != HPAGE_PMD_ORDER) > > > > > return ret; > > > > > + page = &folio->page; > > > > > > > > > > /* > > > > > * Just backoff if any subpage of a THP is corrupted otherwise > > > > > > > > Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> > > > > > > You know what I'm going to ask ;) I'm assuming that the runtime effects > > > are "small performance optimization" and that "should we backport the > > > fix" is "no". > > > > We're going to stop using PMDs to map large folios unless the fault is > > within the first 4KiB of the PMD. No idea how many workloads that > > affects, but it only needs to be backported as far as v6.8, so we > > may as well backport it. > > OK, thanks, I pasted the above text and added the cc:stable. Yes please. My interest in this being that yesterday I discovered the large drop in ShmemPmdMapped between v6.7 and v6.8, bisected, and was testing overnight with a patch very much like this one of Andrew's. I'd been hoping to send mine today, but now no need. > > I didn't move it into the hotfixes queue - it's a non-trivial > behavioral change and extra test time seems prudent(?). It is certainly worth some test soak time, and the bug might have been masking other issues which may now emerge; but the fix is just reverting to the old pre-v6.8 behaviour. Thanks, Hugh
On 11.06.24 23:18, Hugh Dickins wrote: > On Tue, 11 Jun 2024, Andrew Morton wrote: >> On Tue, 11 Jun 2024 19:22:03 +0100 Matthew Wilcox <willy@infradead.org> wrote: >>> On Tue, Jun 11, 2024 at 11:06:22AM -0700, Andrew Morton wrote: >>>> On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote: >>>> >>>>> On 11.06.24 17:32, Andrew Bresticker wrote: >>>>>> The requirement that the head page be passed to do_set_pmd() was added >>>>>> in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> >>>>>> folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the >>>>>> finish_fault() and filemap_map_pages() paths if the page to be inserted >>>>>> is anything but the head page for an otherwise suitable vma and pmd-sized >>>>>> page. >>>>>> >>>>>> Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") >>>>>> Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> >>>>>> --- >>>>>> mm/memory.c | 3 ++- >>>>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>> index 0f47a533014e..a1fce5ddacb3 100644 >>>>>> --- a/mm/memory.c >>>>>> +++ b/mm/memory.c >>>>>> @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) >>>>>> if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) >>>>>> return ret; >>>>>> >>>>>> - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) >>>>>> + if (folio_order(folio) != HPAGE_PMD_ORDER) >>>>>> return ret; >>>>>> + page = &folio->page; >>>>>> >>>>>> /* >>>>>> * Just backoff if any subpage of a THP is corrupted otherwise >>>>> >>>>> Acked-by: David Hildenbrand <david@redhat.com> > > Acked-by: Hugh Dickins <hughd@google.com> > >>>> >>>> You know what I'm going to ask ;) I'm assuming that the runtime effects >>>> are "small performance optimization" and that "should we backport the >>>> fix" is "no". >>> >>> We're going to stop using PMDs to map large folios unless the fault is >>> within the first 4KiB of the PMD. No idea how many workloads that >>> affects, but it only needs to be backported as far as v6.8, so we >>> may as well backport it. >> >> OK, thanks, I pasted the above text and added the cc:stable. > > Yes please. My interest in this being that yesterday I discovered > the large drop in ShmemPmdMapped between v6.7 and v6.8, bisected, > and was testing overnight with a patch very much like this one of > Andrew's. I'd been hoping to send mine today, but now no need. > >> >> I didn't move it into the hotfixes queue - it's a non-trivial >> behavioral change and extra test time seems prudent(?). > > It is certainly worth some test soak time, and the bug might have > been masking other issues which may now emerge; but the fix is > just reverting to the old pre-v6.8 behaviour. Right, I don't expect surprises, really. I'm rather surprised that nobody noticed and that the usual 0-day benchmarks don't trigger that case.
On Tue, Jun 11, 2024 at 07:22:03PM +0100, Matthew Wilcox wrote: > On Tue, Jun 11, 2024 at 11:06:22AM -0700, Andrew Morton wrote: > > On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote: > > > > > On 11.06.24 17:32, Andrew Bresticker wrote: > > > > The requirement that the head page be passed to do_set_pmd() was added > > > > in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> > > > > folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the > > > > finish_fault() and filemap_map_pages() paths if the page to be inserted > > > > is anything but the head page for an otherwise suitable vma and pmd-sized > > > > page. > > > > > > > > Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") > > > > Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> > > > > --- > > > > mm/memory.c | 3 ++- > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > index 0f47a533014e..a1fce5ddacb3 100644 > > > > --- a/mm/memory.c > > > > +++ b/mm/memory.c > > > > @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > > > > if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) > > > > return ret; > > > > > > > > - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) > > > > + if (folio_order(folio) != HPAGE_PMD_ORDER) > > > > return ret; > > > > + page = &folio->page; > > > > > > > > /* > > > > * Just backoff if any subpage of a THP is corrupted otherwise > > > > > > Acked-by: David Hildenbrand <david@redhat.com> > > > > You know what I'm going to ask ;) I'm assuming that the runtime effects > > are "small performance optimization" and that "should we backport the > > fix" is "no". > > We're going to stop using PMDs to map large folios unless the fault is > within the first 4KiB of the PMD. No idea how many workloads that > affects, but it only needs to be backported as far as v6.8, so we > may as well backport it. Hi, I am reviving this thread after noticing this comment attached to the fix. If you intend to install PTE level mappings for faults that happen outside of the first 4KiB, I believe this will make THP support for KVM ineffective.
On Mon, 19 Aug 2024, Vincent Donnefort wrote: > On Tue, Jun 11, 2024 at 07:22:03PM +0100, Matthew Wilcox wrote: > > On Tue, Jun 11, 2024 at 11:06:22AM -0700, Andrew Morton wrote: ... > > > You know what I'm going to ask ;) I'm assuming that the runtime effects > > > are "small performance optimization" and that "should we backport the > > > fix" is "no". > > > > We're going to stop using PMDs to map large folios unless the fault is > > within the first 4KiB of the PMD. No idea how many workloads that > > affects, but it only needs to be backported as far as v6.8, so we > > may as well backport it. > > Hi, I am reviving this thread after noticing this comment attached > to the fix. > > If you intend to install PTE level mappings for faults that happen outside of > the first 4KiB, I believe this will make THP support for KVM ineffective. You can relax, it's okay: where Matthew wrote "We're going to stop...", he was describing the runtime effects of the bug (now fixed) to Andrew, not proposing to make a change to mess up THP support. The fix was backported to v6.9.N, but was too late for v6.8.N EOL. Hugh
diff --git a/mm/memory.c b/mm/memory.c index 0f47a533014e..a1fce5ddacb3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) return ret; - if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) + if (folio_order(folio) != HPAGE_PMD_ORDER) return ret; + page = &folio->page; /* * Just backoff if any subpage of a THP is corrupted otherwise
The requirement that the head page be passed to do_set_pmd() was added in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the finish_fault() and filemap_map_pages() paths if the page to be inserted is anything but the head page for an otherwise suitable vma and pmd-sized page. Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> --- mm/memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)