Message ID | 02ffa542-ce49-4755-9d2b-29841f9973e0@kernel.dk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Hugepage program taking forever to exit | expand |
On Tue, Sep 10, 2024 at 12:21:42PM -0600, Jens Axboe wrote: > Hi, > > Investigating another issue, I wrote the following simple program that allocates > and faults in 500 1GB huge pages, and then registers them with io_uring. Each > step is timed: > > Got 500 huge pages (each 1024MB) in 0 msec > Faulted in 500 huge pages in 38632 msec > Registered 500 pages in 867 msec > > and as expected, faulting in the pages takes (by far) the longest. From > the above, you'd also expect the total runtime to be around ~39 seconds. > But it is not... In fact it takes 82 seconds in total for this program > to have exited. Looking at why, I see: > > [<0>] __wait_rcu_gp+0x12b/0x160 > [<0>] synchronize_rcu_normal.part.0+0x2a/0x30 > [<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0 > [<0>] update_and_free_pages_bulk+0x4c/0x220 > [<0>] return_unused_surplus_pages+0x80/0xa0 > [<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0 > [<0>] hugetlb_vm_op_close+0x160/0x180 > [<0>] remove_vma+0x20/0x60 > [<0>] exit_mmap+0x199/0x340 > [<0>] mmput+0x49/0x110 > [<0>] do_exit+0x261/0x9b0 > [<0>] do_group_exit+0x2c/0x80 > [<0>] __x64_sys_exit_group+0x14/0x20 > [<0>] x64_sys_call+0x714/0x720 > [<0>] do_syscall_64+0x5b/0x160 > [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Yeah, this looks wrong to me: void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list) { struct folio *folio; LIST_HEAD(vmemmap_pages); list_for_each_entry(folio, folio_list, lru) { int ret = hugetlb_vmemmap_split_folio(h, folio); /* * Spliting the PMD requires allocating a page, thus lets fail * early once we encounter the first OOM. No point in retrying * as it can be dynamically done on remap with the memory * we get back from the vmemmap deduplication. */ if (ret == -ENOMEM) break; } flush_tlb_all(); /* avoid writes from page_ref_add_unless() while folding vmemmap */ synchronize_rcu(); list_for_each_entry(folio, folio_list, lru) { int ret; ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_REMAP_NO_TLB_FLUSH); /* * Pages to be freed may have been accumulated. If we * encounter an ENOMEM, free what we have and try again. * This can occur in the case that both spliting fails * halfway and head page allocation also failed. In this * case __hugetlb_vmemmap_optimize_folio() would free memory * allowing more vmemmap remaps to occur. */ if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) { flush_tlb_all(); free_vmemmap_page_list(&vmemmap_pages); INIT_LIST_HEAD(&vmemmap_pages); __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_REMAP_NO_TLB_FLUSH); } } flush_tlb_all(); free_vmemmap_page_list(&vmemmap_pages); } If you don't have HVO enabled, then hugetlb_vmemmap_split_folio() does nothing. And __hugetlb_vmemmap_optimize_folio() also does nothing, leaving &vmemmap_pages empty and free_vmemmap_page_list() a nop. So what's left is: it flushes the TLB twice and waits for RCU. What for exactly? The same is true for hugetlb_vmemmap_optimize_folio() and the corresponding split function, which waits for RCU on every page being allocated and freed, even if the vmemmap is left alone. Surely all those RCU waits and tlb flushes should be guarded by whether the HVO is actually enabled, no?
On Tue, Sep 10, 2024 at 12:21 PM Jens Axboe <axboe@kernel.dk> wrote: > > Hi, > > Investigating another issue, I wrote the following simple program that allocates > and faults in 500 1GB huge pages, and then registers them with io_uring. Each > step is timed: > > Got 500 huge pages (each 1024MB) in 0 msec > Faulted in 500 huge pages in 38632 msec > Registered 500 pages in 867 msec > > and as expected, faulting in the pages takes (by far) the longest. From > the above, you'd also expect the total runtime to be around ~39 seconds. > But it is not... In fact it takes 82 seconds in total for this program > to have exited. Looking at why, I see: > > [<0>] __wait_rcu_gp+0x12b/0x160 > [<0>] synchronize_rcu_normal.part.0+0x2a/0x30 > [<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0 > [<0>] update_and_free_pages_bulk+0x4c/0x220 > [<0>] return_unused_surplus_pages+0x80/0xa0 > [<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0 > [<0>] hugetlb_vm_op_close+0x160/0x180 > [<0>] remove_vma+0x20/0x60 > [<0>] exit_mmap+0x199/0x340 > [<0>] mmput+0x49/0x110 > [<0>] do_exit+0x261/0x9b0 > [<0>] do_group_exit+0x2c/0x80 > [<0>] __x64_sys_exit_group+0x14/0x20 > [<0>] x64_sys_call+0x714/0x720 > [<0>] do_syscall_64+0x5b/0x160 > [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > > and yes, it does look like the program is mostly idle for most of the > time while returning these huge pages. It's also telling us exactly why > we're just sitting idle - RCU grace period. > > The below quick change means the runtime of the program is pretty much > just the time it takes to execute the parts of it, as you can see from > the full output after the change: > > axboe@r7525 ~> time sudo ./reg-huge > Got 500 huge pages (each 1024MB) in 0 msec > Faulted in 500 huge pages in 38632 msec > Registered 500 pages in 867 msec > > ________________________________________________________ > Executed in 39.53 secs fish external > usr time 4.88 millis 238.00 micros 4.64 millis > sys time 0.00 millis 0.00 micros 0.00 millis > > where 38632+876 == 39.51s. > > Looks like this was introduced by: > > commit bd225530a4c717714722c3731442b78954c765b3 > Author: Yu Zhao <yuzhao@google.com> > Date: Thu Jun 27 16:27:05 2024 -0600 Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting Additional improvements from mm-stable that may or may not help your test case: e98337d11bbd mm/contig_alloc: support __GFP_COMP 463586e9ff39 mm/cma: add cma_{alloc,free}_folio() cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios
On 9/10/24 2:17 PM, Yu Zhao wrote: > On Tue, Sep 10, 2024 at 12:21?PM Jens Axboe <axboe@kernel.dk> wrote: >> >> Hi, >> >> Investigating another issue, I wrote the following simple program that allocates >> and faults in 500 1GB huge pages, and then registers them with io_uring. Each >> step is timed: >> >> Got 500 huge pages (each 1024MB) in 0 msec >> Faulted in 500 huge pages in 38632 msec >> Registered 500 pages in 867 msec >> >> and as expected, faulting in the pages takes (by far) the longest. From >> the above, you'd also expect the total runtime to be around ~39 seconds. >> But it is not... In fact it takes 82 seconds in total for this program >> to have exited. Looking at why, I see: >> >> [<0>] __wait_rcu_gp+0x12b/0x160 >> [<0>] synchronize_rcu_normal.part.0+0x2a/0x30 >> [<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0 >> [<0>] update_and_free_pages_bulk+0x4c/0x220 >> [<0>] return_unused_surplus_pages+0x80/0xa0 >> [<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0 >> [<0>] hugetlb_vm_op_close+0x160/0x180 >> [<0>] remove_vma+0x20/0x60 >> [<0>] exit_mmap+0x199/0x340 >> [<0>] mmput+0x49/0x110 >> [<0>] do_exit+0x261/0x9b0 >> [<0>] do_group_exit+0x2c/0x80 >> [<0>] __x64_sys_exit_group+0x14/0x20 >> [<0>] x64_sys_call+0x714/0x720 >> [<0>] do_syscall_64+0x5b/0x160 >> [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 >> >> and yes, it does look like the program is mostly idle for most of the >> time while returning these huge pages. It's also telling us exactly why >> we're just sitting idle - RCU grace period. >> >> The below quick change means the runtime of the program is pretty much >> just the time it takes to execute the parts of it, as you can see from >> the full output after the change: >> >> axboe@r7525 ~> time sudo ./reg-huge >> Got 500 huge pages (each 1024MB) in 0 msec >> Faulted in 500 huge pages in 38632 msec >> Registered 500 pages in 867 msec >> >> ________________________________________________________ >> Executed in 39.53 secs fish external >> usr time 4.88 millis 238.00 micros 4.64 millis >> sys time 0.00 millis 0.00 micros 0.00 millis >> >> where 38632+876 == 39.51s. >> >> Looks like this was introduced by: >> >> commit bd225530a4c717714722c3731442b78954c765b3 >> Author: Yu Zhao <yuzhao@google.com> >> Date: Thu Jun 27 16:27:05 2024 -0600 > > Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable > > c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO > c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting > > Additional improvements from mm-stable that may or may not help your test case: > > e98337d11bbd mm/contig_alloc: support __GFP_COMP > 463586e9ff39 mm/cma: add cma_{alloc,free}_folio() > cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios I didn't pick patches, just pulled all of mm-stable - yep that fixes it for me, no longer spending 60+% of the runtime during exit. I trust these are heading to stable?
On Tue, 10 Sep 2024 17:08:30 -0600 Jens Axboe <axboe@kernel.dk> wrote: > > Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable > > > > c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO > > c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting > > > > Additional improvements from mm-stable that may or may not help your test case: > > > > e98337d11bbd mm/contig_alloc: support __GFP_COMP > > 463586e9ff39 mm/cma: add cma_{alloc,free}_folio() > > cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios > > I didn't pick patches, just pulled all of mm-stable - yep that fixes it > for me, no longer spending 60+% of the runtime during exit. I trust > these are heading to stable? None of these are tagged for -stable backporting. We can later request a backport without the cc:stable tag, although narrowing down the required patches would be good.
On 9/10/24 9:42 PM, Andrew Morton wrote: > On Tue, 10 Sep 2024 17:08:30 -0600 Jens Axboe <axboe@kernel.dk> wrote: > >>> Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable >>> >>> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO >>> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting >>> >>> Additional improvements from mm-stable that may or may not help your test case: >>> >>> e98337d11bbd mm/contig_alloc: support __GFP_COMP >>> 463586e9ff39 mm/cma: add cma_{alloc,free}_folio() >>> cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios >> >> I didn't pick patches, just pulled all of mm-stable - yep that fixes it >> for me, no longer spending 60+% of the runtime during exit. I trust >> these are heading to stable? > > None of these are tagged for -stable backporting. We can later request > a backport without the cc:stable tag, although narrowing down the > required patches would be good. These two: c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting to cover both cases.
On Wed, Sep 11, 2024 at 7:22 AM Jens Axboe <axboe@kernel.dk> wrote: > > On 9/10/24 9:42 PM, Andrew Morton wrote: > > On Tue, 10 Sep 2024 17:08:30 -0600 Jens Axboe <axboe@kernel.dk> wrote: > > > >>> Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable > >>> > >>> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO > >>> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting > >>> > >>> Additional improvements from mm-stable that may or may not help your test case: > >>> > >>> e98337d11bbd mm/contig_alloc: support __GFP_COMP > >>> 463586e9ff39 mm/cma: add cma_{alloc,free}_folio() > >>> cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios > >> > >> I didn't pick patches, just pulled all of mm-stable - yep that fixes it > >> for me, no longer spending 60+% of the runtime during exit. I trust > >> these are heading to stable? > > > > None of these are tagged for -stable backporting. We can later request > > a backport without the cc:stable tag, although narrowing down the > > required patches would be good. > > These two: > > c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting This has the stable tag. > c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO This one doesn't -- I was hoping it could make into 6.10-rc but that didn't happen. Is it still possible to add the stable tag? If not, I'll follow up on sending the backport.
On Wed, 11 Sep 2024 07:22:17 -0600 Jens Axboe <axboe@kernel.dk> wrote: > On 9/10/24 9:42 PM, Andrew Morton wrote: > > On Tue, 10 Sep 2024 17:08:30 -0600 Jens Axboe <axboe@kernel.dk> wrote: > > > >>> Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable > >>> > >>> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO > >>> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting > >>> > >>> Additional improvements from mm-stable that may or may not help your test case: > >>> > >>> e98337d11bbd mm/contig_alloc: support __GFP_COMP > >>> 463586e9ff39 mm/cma: add cma_{alloc,free}_folio() > >>> cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios > >> > >> I didn't pick patches, just pulled all of mm-stable - yep that fixes it > >> for me, no longer spending 60+% of the runtime during exit. I trust > >> these are heading to stable? > > > > None of these are tagged for -stable backporting. We can later request > > a backport without the cc:stable tag, although narrowing down the > > required patches would be good. > > These two: > > c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO > c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting > > to cover both cases. OK thanks. Yu, could you please request a -stable backport after this hits mainline? Thanks.
On Wed, Sep 11, 2024 at 12:39 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 11 Sep 2024 07:22:17 -0600 Jens Axboe <axboe@kernel.dk> wrote: > > > On 9/10/24 9:42 PM, Andrew Morton wrote: > > > On Tue, 10 Sep 2024 17:08:30 -0600 Jens Axboe <axboe@kernel.dk> wrote: > > > > > >>> Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable > > >>> > > >>> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO > > >>> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting > > >>> > > >>> Additional improvements from mm-stable that may or may not help your test case: > > >>> > > >>> e98337d11bbd mm/contig_alloc: support __GFP_COMP > > >>> 463586e9ff39 mm/cma: add cma_{alloc,free}_folio() > > >>> cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios > > >> > > >> I didn't pick patches, just pulled all of mm-stable - yep that fixes it > > >> for me, no longer spending 60+% of the runtime during exit. I trust > > >> these are heading to stable? > > > > > > None of these are tagged for -stable backporting. We can later request > > > a backport without the cc:stable tag, although narrowing down the > > > required patches would be good. > > > > These two: > > > > c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO > > c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting > > > > to cover both cases. > > OK thanks. > > Yu, could you please request a -stable backport after this hits mainline? Will do, thanks.
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 0c3f56b3578e..95f6ad8f8232 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -517,7 +517,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h, long ret = 0; /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ - synchronize_rcu(); + synchronize_rcu_expedited(); list_for_each_entry_safe(folio, t_folio, folio_list, lru) { if (folio_test_hugetlb_vmemmap_optimized(folio)) {