diff mbox series

Hugepage program taking forever to exit

Message ID 02ffa542-ce49-4755-9d2b-29841f9973e0@kernel.dk (mailing list archive)
State New
Headers show
Series Hugepage program taking forever to exit | expand

Commit Message

Jens Axboe Sept. 10, 2024, 6:21 p.m. UTC
Hi,

Investigating another issue, I wrote the following simple program that allocates
and faults in 500 1GB huge pages, and then registers them with io_uring. Each
step is timed:

Got 500 huge pages (each 1024MB) in 0 msec
Faulted in 500 huge pages in 38632 msec
Registered 500 pages in 867 msec

and as expected, faulting in the pages takes (by far) the longest. From
the above, you'd also expect the total runtime to be around ~39 seconds.
But it is not... In fact it takes 82 seconds in total for this program
to have exited. Looking at why, I see:

[<0>] __wait_rcu_gp+0x12b/0x160
[<0>] synchronize_rcu_normal.part.0+0x2a/0x30
[<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0
[<0>] update_and_free_pages_bulk+0x4c/0x220
[<0>] return_unused_surplus_pages+0x80/0xa0
[<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0
[<0>] hugetlb_vm_op_close+0x160/0x180
[<0>] remove_vma+0x20/0x60
[<0>] exit_mmap+0x199/0x340
[<0>] mmput+0x49/0x110
[<0>] do_exit+0x261/0x9b0
[<0>] do_group_exit+0x2c/0x80
[<0>] __x64_sys_exit_group+0x14/0x20
[<0>] x64_sys_call+0x714/0x720
[<0>] do_syscall_64+0x5b/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

and yes, it does look like the program is mostly idle for most of the
time while returning these huge pages. It's also telling us exactly why
we're just sitting idle - RCU grace period.

The below quick change means the runtime of the program is pretty much
just the time it takes to execute the parts of it, as you can see from
the full output after the change:

axboe@r7525 ~> time sudo ./reg-huge
Got 500 huge pages (each 1024MB) in 0 msec
Faulted in 500 huge pages in 38632 msec
Registered 500 pages in 867 msec

Comments

Johannes Weiner Sept. 10, 2024, 7:33 p.m. UTC | #1
On Tue, Sep 10, 2024 at 12:21:42PM -0600, Jens Axboe wrote:
> Hi,
> 
> Investigating another issue, I wrote the following simple program that allocates
> and faults in 500 1GB huge pages, and then registers them with io_uring. Each
> step is timed:
> 
> Got 500 huge pages (each 1024MB) in 0 msec
> Faulted in 500 huge pages in 38632 msec
> Registered 500 pages in 867 msec
> 
> and as expected, faulting in the pages takes (by far) the longest. From
> the above, you'd also expect the total runtime to be around ~39 seconds.
> But it is not... In fact it takes 82 seconds in total for this program
> to have exited. Looking at why, I see:
> 
> [<0>] __wait_rcu_gp+0x12b/0x160
> [<0>] synchronize_rcu_normal.part.0+0x2a/0x30
> [<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0
> [<0>] update_and_free_pages_bulk+0x4c/0x220
> [<0>] return_unused_surplus_pages+0x80/0xa0
> [<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0
> [<0>] hugetlb_vm_op_close+0x160/0x180
> [<0>] remove_vma+0x20/0x60
> [<0>] exit_mmap+0x199/0x340
> [<0>] mmput+0x49/0x110
> [<0>] do_exit+0x261/0x9b0
> [<0>] do_group_exit+0x2c/0x80
> [<0>] __x64_sys_exit_group+0x14/0x20
> [<0>] x64_sys_call+0x714/0x720
> [<0>] do_syscall_64+0x5b/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

Yeah, this looks wrong to me:

void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
{
	struct folio *folio;
	LIST_HEAD(vmemmap_pages);

	list_for_each_entry(folio, folio_list, lru) {
		int ret = hugetlb_vmemmap_split_folio(h, folio);

		/*
		 * Spliting the PMD requires allocating a page, thus lets fail
		 * early once we encounter the first OOM. No point in retrying
		 * as it can be dynamically done on remap with the memory
		 * we get back from the vmemmap deduplication.
		 */
		if (ret == -ENOMEM)
			break;
	}

	flush_tlb_all();

	/* avoid writes from page_ref_add_unless() while folding vmemmap */
	synchronize_rcu();

	list_for_each_entry(folio, folio_list, lru) {
		int ret;

		ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
						       VMEMMAP_REMAP_NO_TLB_FLUSH);

		/*
		 * Pages to be freed may have been accumulated.  If we
		 * encounter an ENOMEM,  free what we have and try again.
		 * This can occur in the case that both spliting fails
		 * halfway and head page allocation also failed. In this
		 * case __hugetlb_vmemmap_optimize_folio() would free memory
		 * allowing more vmemmap remaps to occur.
		 */
		if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
			flush_tlb_all();
			free_vmemmap_page_list(&vmemmap_pages);
			INIT_LIST_HEAD(&vmemmap_pages);
			__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
							 VMEMMAP_REMAP_NO_TLB_FLUSH);
		}
	}

	flush_tlb_all();
	free_vmemmap_page_list(&vmemmap_pages);
}

If you don't have HVO enabled, then hugetlb_vmemmap_split_folio() does
nothing. And __hugetlb_vmemmap_optimize_folio() also does nothing,
leaving &vmemmap_pages empty and free_vmemmap_page_list() a nop.

So what's left is: it flushes the TLB twice and waits for RCU. What
for exactly?

The same is true for hugetlb_vmemmap_optimize_folio() and the
corresponding split function, which waits for RCU on every page being
allocated and freed, even if the vmemmap is left alone.

Surely all those RCU waits and tlb flushes should be guarded by
whether the HVO is actually enabled, no?
Yu Zhao Sept. 10, 2024, 8:17 p.m. UTC | #2
On Tue, Sep 10, 2024 at 12:21 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Hi,
>
> Investigating another issue, I wrote the following simple program that allocates
> and faults in 500 1GB huge pages, and then registers them with io_uring. Each
> step is timed:
>
> Got 500 huge pages (each 1024MB) in 0 msec
> Faulted in 500 huge pages in 38632 msec
> Registered 500 pages in 867 msec
>
> and as expected, faulting in the pages takes (by far) the longest. From
> the above, you'd also expect the total runtime to be around ~39 seconds.
> But it is not... In fact it takes 82 seconds in total for this program
> to have exited. Looking at why, I see:
>
> [<0>] __wait_rcu_gp+0x12b/0x160
> [<0>] synchronize_rcu_normal.part.0+0x2a/0x30
> [<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0
> [<0>] update_and_free_pages_bulk+0x4c/0x220
> [<0>] return_unused_surplus_pages+0x80/0xa0
> [<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0
> [<0>] hugetlb_vm_op_close+0x160/0x180
> [<0>] remove_vma+0x20/0x60
> [<0>] exit_mmap+0x199/0x340
> [<0>] mmput+0x49/0x110
> [<0>] do_exit+0x261/0x9b0
> [<0>] do_group_exit+0x2c/0x80
> [<0>] __x64_sys_exit_group+0x14/0x20
> [<0>] x64_sys_call+0x714/0x720
> [<0>] do_syscall_64+0x5b/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> and yes, it does look like the program is mostly idle for most of the
> time while returning these huge pages. It's also telling us exactly why
> we're just sitting idle - RCU grace period.
>
> The below quick change means the runtime of the program is pretty much
> just the time it takes to execute the parts of it, as you can see from
> the full output after the change:
>
> axboe@r7525 ~> time sudo ./reg-huge
> Got 500 huge pages (each 1024MB) in 0 msec
> Faulted in 500 huge pages in 38632 msec
> Registered 500 pages in 867 msec
>
> ________________________________________________________
> Executed in   39.53 secs      fish           external
>    usr time    4.88 millis  238.00 micros    4.64 millis
>    sys time    0.00 millis    0.00 micros    0.00 millis
>
> where 38632+876 == 39.51s.
>
> Looks like this was introduced by:
>
> commit bd225530a4c717714722c3731442b78954c765b3
> Author: Yu Zhao <yuzhao@google.com>
> Date:   Thu Jun 27 16:27:05 2024 -0600

Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable

c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting

Additional improvements from mm-stable that may or may not help your test case:

e98337d11bbd mm/contig_alloc: support __GFP_COMP
463586e9ff39 mm/cma: add cma_{alloc,free}_folio()
cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios
Jens Axboe Sept. 10, 2024, 11:08 p.m. UTC | #3
On 9/10/24 2:17 PM, Yu Zhao wrote:
> On Tue, Sep 10, 2024 at 12:21?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Hi,
>>
>> Investigating another issue, I wrote the following simple program that allocates
>> and faults in 500 1GB huge pages, and then registers them with io_uring. Each
>> step is timed:
>>
>> Got 500 huge pages (each 1024MB) in 0 msec
>> Faulted in 500 huge pages in 38632 msec
>> Registered 500 pages in 867 msec
>>
>> and as expected, faulting in the pages takes (by far) the longest. From
>> the above, you'd also expect the total runtime to be around ~39 seconds.
>> But it is not... In fact it takes 82 seconds in total for this program
>> to have exited. Looking at why, I see:
>>
>> [<0>] __wait_rcu_gp+0x12b/0x160
>> [<0>] synchronize_rcu_normal.part.0+0x2a/0x30
>> [<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0
>> [<0>] update_and_free_pages_bulk+0x4c/0x220
>> [<0>] return_unused_surplus_pages+0x80/0xa0
>> [<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0
>> [<0>] hugetlb_vm_op_close+0x160/0x180
>> [<0>] remove_vma+0x20/0x60
>> [<0>] exit_mmap+0x199/0x340
>> [<0>] mmput+0x49/0x110
>> [<0>] do_exit+0x261/0x9b0
>> [<0>] do_group_exit+0x2c/0x80
>> [<0>] __x64_sys_exit_group+0x14/0x20
>> [<0>] x64_sys_call+0x714/0x720
>> [<0>] do_syscall_64+0x5b/0x160
>> [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>
>> and yes, it does look like the program is mostly idle for most of the
>> time while returning these huge pages. It's also telling us exactly why
>> we're just sitting idle - RCU grace period.
>>
>> The below quick change means the runtime of the program is pretty much
>> just the time it takes to execute the parts of it, as you can see from
>> the full output after the change:
>>
>> axboe@r7525 ~> time sudo ./reg-huge
>> Got 500 huge pages (each 1024MB) in 0 msec
>> Faulted in 500 huge pages in 38632 msec
>> Registered 500 pages in 867 msec
>>
>> ________________________________________________________
>> Executed in   39.53 secs      fish           external
>>    usr time    4.88 millis  238.00 micros    4.64 millis
>>    sys time    0.00 millis    0.00 micros    0.00 millis
>>
>> where 38632+876 == 39.51s.
>>
>> Looks like this was introduced by:
>>
>> commit bd225530a4c717714722c3731442b78954c765b3
>> Author: Yu Zhao <yuzhao@google.com>
>> Date:   Thu Jun 27 16:27:05 2024 -0600
> 
> Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable
> 
> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting
> 
> Additional improvements from mm-stable that may or may not help your test case:
> 
> e98337d11bbd mm/contig_alloc: support __GFP_COMP
> 463586e9ff39 mm/cma: add cma_{alloc,free}_folio()
> cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios

I didn't pick patches, just pulled all of mm-stable - yep that fixes it
for me, no longer spending 60+% of the runtime during exit. I trust
these are heading to stable?
Andrew Morton Sept. 11, 2024, 3:42 a.m. UTC | #4
On Tue, 10 Sep 2024 17:08:30 -0600 Jens Axboe <axboe@kernel.dk> wrote:

> > Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable
> > 
> > c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
> > c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting
> > 
> > Additional improvements from mm-stable that may or may not help your test case:
> > 
> > e98337d11bbd mm/contig_alloc: support __GFP_COMP
> > 463586e9ff39 mm/cma: add cma_{alloc,free}_folio()
> > cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios
> 
> I didn't pick patches, just pulled all of mm-stable - yep that fixes it
> for me, no longer spending 60+% of the runtime during exit. I trust
> these are heading to stable?

None of these are tagged for -stable backporting.  We can later request
a backport without the cc:stable tag, although narrowing down the
required patches would be good.
Jens Axboe Sept. 11, 2024, 1:22 p.m. UTC | #5
On 9/10/24 9:42 PM, Andrew Morton wrote:
> On Tue, 10 Sep 2024 17:08:30 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> 
>>> Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable
>>>
>>> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
>>> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting
>>>
>>> Additional improvements from mm-stable that may or may not help your test case:
>>>
>>> e98337d11bbd mm/contig_alloc: support __GFP_COMP
>>> 463586e9ff39 mm/cma: add cma_{alloc,free}_folio()
>>> cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios
>>
>> I didn't pick patches, just pulled all of mm-stable - yep that fixes it
>> for me, no longer spending 60+% of the runtime during exit. I trust
>> these are heading to stable?
> 
> None of these are tagged for -stable backporting.  We can later request
> a backport without the cc:stable tag, although narrowing down the
> required patches would be good.

These two:

c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting

to cover both cases.
Yu Zhao Sept. 11, 2024, 4:23 p.m. UTC | #6
On Wed, Sep 11, 2024 at 7:22 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 9/10/24 9:42 PM, Andrew Morton wrote:
> > On Tue, 10 Sep 2024 17:08:30 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> >
> >>> Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable
> >>>
> >>> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
> >>> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting
> >>>
> >>> Additional improvements from mm-stable that may or may not help your test case:
> >>>
> >>> e98337d11bbd mm/contig_alloc: support __GFP_COMP
> >>> 463586e9ff39 mm/cma: add cma_{alloc,free}_folio()
> >>> cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios
> >>
> >> I didn't pick patches, just pulled all of mm-stable - yep that fixes it
> >> for me, no longer spending 60+% of the runtime during exit. I trust
> >> these are heading to stable?
> >
> > None of these are tagged for -stable backporting.  We can later request
> > a backport without the cc:stable tag, although narrowing down the
> > required patches would be good.
>
> These two:
>
> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting

This has the stable tag.

> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO

This one doesn't -- I was hoping it could make into 6.10-rc but that
didn't happen. Is it still possible to add the stable tag? If not,
I'll follow up on sending the backport.
Andrew Morton Sept. 11, 2024, 6:38 p.m. UTC | #7
On Wed, 11 Sep 2024 07:22:17 -0600 Jens Axboe <axboe@kernel.dk> wrote:

> On 9/10/24 9:42 PM, Andrew Morton wrote:
> > On Tue, 10 Sep 2024 17:08:30 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> > 
> >>> Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable
> >>>
> >>> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
> >>> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting
> >>>
> >>> Additional improvements from mm-stable that may or may not help your test case:
> >>>
> >>> e98337d11bbd mm/contig_alloc: support __GFP_COMP
> >>> 463586e9ff39 mm/cma: add cma_{alloc,free}_folio()
> >>> cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios
> >>
> >> I didn't pick patches, just pulled all of mm-stable - yep that fixes it
> >> for me, no longer spending 60+% of the runtime during exit. I trust
> >> these are heading to stable?
> > 
> > None of these are tagged for -stable backporting.  We can later request
> > a backport without the cc:stable tag, although narrowing down the
> > required patches would be good.
> 
> These two:
> 
> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting
> 
> to cover both cases.

OK thanks.

Yu, could you please request a -stable backport after this hits mainline?

Thanks.
Yu Zhao Sept. 11, 2024, 10:08 p.m. UTC | #8
On Wed, Sep 11, 2024 at 12:39 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Wed, 11 Sep 2024 07:22:17 -0600 Jens Axboe <axboe@kernel.dk> wrote:
>
> > On 9/10/24 9:42 PM, Andrew Morton wrote:
> > > On Tue, 10 Sep 2024 17:08:30 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> > >
> > >>> Fixes are in git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable
> > >>>
> > >>> c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
> > >>> c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting
> > >>>
> > >>> Additional improvements from mm-stable that may or may not help your test case:
> > >>>
> > >>> e98337d11bbd mm/contig_alloc: support __GFP_COMP
> > >>> 463586e9ff39 mm/cma: add cma_{alloc,free}_folio()
> > >>> cf54f310d0d3 mm/hugetlb: use __GFP_COMP for gigantic folios
> > >>
> > >> I didn't pick patches, just pulled all of mm-stable - yep that fixes it
> > >> for me, no longer spending 60+% of the runtime during exit. I trust
> > >> these are heading to stable?
> > >
> > > None of these are tagged for -stable backporting.  We can later request
> > > a backport without the cc:stable tag, although narrowing down the
> > > required patches would be good.
> >
> > These two:
> >
> > c2a967f6ab0e mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
> > c0f398c3b2cf mm/hugetlb_vmemmap: batch HVO work when demoting
> >
> > to cover both cases.
>
> OK thanks.
>
> Yu, could you please request a -stable backport after this hits mainline?

Will do, thanks.
diff mbox series

Patch

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 0c3f56b3578e..95f6ad8f8232 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -517,7 +517,7 @@  long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 	long ret = 0;
 
 	/* avoid writes from page_ref_add_unless() while unfolding vmemmap */
-	synchronize_rcu();
+	synchronize_rcu_expedited();
 
 	list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
 		if (folio_test_hugetlb_vmemmap_optimized(folio)) {