[RFC,00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

Message ID	20250305181611.54484-1-sj@kernel.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: SeongJae Park <sj@kernel.org> To: Cc: SeongJae Park <sj@kernel.org>, "Liam R. Howlett" <howlett@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Lorenzo Stoakes <lorenzo.stoakes@oracle.com>, Shakeel Butt <shakeel.butt@linux.dev>, Vlastimil Babka <vbabka@suse.cz>, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE Date: Wed, 5 Mar 2025 10:15:55 -0800 Message-Id: <20250305181611.54484-1-sj@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE \| expand [RFC,00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE [RFC,01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior() [RFC,02/16] mm/madvise: split out populate behavior check logic [RFC,03/16] mm/madvise: deduplicate madvise_do_behavior() skip case handlings [RFC,04/16] mm/madvise: remove len parameter of madvise_do_behavior() [RFC,05/16] mm/madvise: define and use madvise_behavior struct for madvise_do_behavior() [RFC,06/16] mm/madvise: pass madvise_behavior struct to madvise_vma_behavior() [RFC,07/16] mm/madvise: make madvise_walk_vmas() visit function receives a void pointer [RFC,08/16] mm/madvise: pass madvise_behavior struct to madvise_dontneed_free() [RFC,09/16] mm/memory: split non-tlb flushing part from zap_page_range_single() [RFC,10/16] mm/madvise: let madvise_dontneed_single_vma() caller batches tlb flushes [RFC,11/16] mm/madvise: let madvise_free_single_vma() caller batches tlb flushes [RFC,12/16] mm/madvise: batch tlb flushes for process_madvise(MADV_DONTNEED[_LOCKED]) [RFC,13/16] mm/madvise: batch tlb flushes for process_madvise(MADV_FREE) [RFC,14/16] mm/madvise: batch tlb flushes for madvise(MADV_{DONTNEED[_LOCKED],FREE} [RFC,15/16] mm/madvise: remove !tlb support from madvise_dontneed_single_vma() [RFC,16/16] mm/madvise: remove !caller_tlb case of madvise_free_single_vma()

SeongJae Park March 5, 2025, 6:15 p.m. UTC

For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
can happen for each vma of the given address ranges.  Because such tlb
flushes are for address ranges of same process, doing those in a batch
is more efficient while still being safe.  Modify madvise() and
process_madvise() entry level code path to do such batched tlb flushes,
while the internal unmap logics do only gathering of the tlb entries to
flush.

In more detail, modify the entry functions to initialize an mmu_gather
ojbect and pass it to the internal logics.  Also modify the internal
logics to do only gathering of the tlb entries to flush into the
received mmu_gather object.  After all internal function calls are done,
the entry functions finish the mmu_gather object to flush the gathered
tlb entries in the one batch.

Patches Seuquence
=================

First four patches are minor cleanups of madvise.c for readability.

Following four patches (patches 5-8) define new data structure for
managing information that required for batched tlb flushing (mmu_gather
and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and
MADV_FREE handling internal logics to receive it.

Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED]
and MADV_FREE handling logic ready for batched tlb flushing follow.  The
patches keep the support of unbatched tlb flushes use case, for
fine-grained and safe transitions.

Next three patches (patches 12-14) update madvise() and
process_madvise() code to do the batched tlb flushes utilizing the
previous patches introduced changes.

Final two patches (patches 15-16) clean up the internal logics'
unbatched tlb flushes use case support code, which is no more be used.

Test Results
============

I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls.  I apply the advice in 4 KiB
sized regions granularity, but with varying batch size (vlen) from 1 to
1024.  The source code for the measurement is available at GitHub[1].

The measurement results are as below.  'sz_batches' column shows the
batch size of process_madvise() calls.  'before' and 'after' columns are
the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in
nanoseconds, on kernels that built without and with the MADV_DONTNEED
tlb flushes batching patch of this series, respectively.  For the
baseline, mm-unstable tree of 2025-03-04[2] has been used.
'after/before' column is the ratio of 'after' to 'before'.  So
'afetr/before' value lower than 1.0 means this patch increased
efficiency over the baseline.  And lower value means better efficiency.

    sz_batches    before       after        after/before
    1             102842895    106507398    1.03563204828102
    2             73364942     74529223     1.01586971880929
    4             58823633     51608504     0.877343022998937
    8             47532390     44820223     0.942940655834895
    16            43591587     36727177     0.842529018271347
    32            44207282     33946975     0.767904595446515
    64            41832437     26738286     0.639175910310939
    128           40278193     23262940     0.577556694263817
    256           41568533     22355103     0.537789077136785
    512           41626638     22822516     0.54826709762148
    1024          44440870     22676017     0.510251419470411

For <=2 batch size, tlb flushes batching shows no big difference but
slight overhead.  I think that's in an error range of this simple
micro-benchmark, and therefore can be ignored.  Starting from batch size
4, however, tlb flushes batching shows clear efficiency gain.  The
efficiency gain tends to be proportional to the batch size, as expected.
The efficiency gain ranges from about 13 percent with batch size 4, and
up to 49 percent with batch size 1,024.

Please note that this is a very simple microbenchmark, so real
efficiency gain on real workload could be very different.

References
==========

[1] https://github.com/sjp38/eval_proc_madvise
[2] commit 7b6c5895bb9a ("mm: hugetlb: log time needed to allocate hugepages") # mm-unstable

SeongJae Park (16):
  mm/madvise: use is_memory_failure() from madvise_do_behavior()
  mm/madvise: split out populate behavior check logic
  mm/madvise: deduplicate madvise_do_behavior() skip case handlings
  mm/madvise: remove len parameter of madvise_do_behavior()
  mm/madvise: define and use madvise_behavior struct for
    madvise_do_behavior()
  mm/madvise: pass madvise_behavior struct to madvise_vma_behavior()
  mm/madvise: make madvise_walk_vmas() visit function receives a void
    pointer
  mm/madvise: pass madvise_behavior struct to madvise_dontneed_free()
  mm/memory: split non-tlb flushing part from zap_page_range_single()
  mm/madvise: let madvise_dontneed_single_vma() caller batches tlb
    flushes
  mm/madvise: let madvise_free_single_vma() caller batches tlb flushes
  mm/madvise: batch tlb flushes for
    process_madvise(MADV_DONTNEED[_LOCKED])
  mm/madvise: batch tlb flushes for process_madvise(MADV_FREE)
  mm/madvise: batch tlb flushes for
    madvise(MADV_{DONTNEED[_LOCKED],FREE}
  mm/madvise: remove !tlb support from madvise_dontneed_single_vma()
  mm/madvise: remove !caller_tlb case of madvise_free_single_vma()

 mm/internal.h |   3 +
 mm/madvise.c  | 176 ++++++++++++++++++++++++++++++++++----------------
 mm/memory.c   |  36 +++++++----
 3 files changed, 144 insertions(+), 71 deletions(-)


base-commit: f653b037b4a6d83c68098fc3949090dfb63316fb

Matthew Wilcox March 5, 2025, 6:56 p.m. UTC | #1

On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> can happen for each vma of the given address ranges.  Because such tlb
> flushes are for address ranges of same process, doing those in a batch
> is more efficient while still being safe.  Modify madvise() and
> process_madvise() entry level code path to do such batched tlb flushes,
> while the internal unmap logics do only gathering of the tlb entries to
> flush.

Do real applications actually do madvise requests that span multiple
VMAs?  It just seems weird to me.  Like, each vma comes from a separate
call to mmap [1], so why would it make sense for an application to
call madvise() across a VMA boundary?

[1] Yes, I know we sometimes merge and/or split VMAs

David Hildenbrand March 5, 2025, 7:19 p.m. UTC | #2

On 05.03.25 19:56, Matthew Wilcox wrote:
> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
>> can happen for each vma of the given address ranges.  Because such tlb
>> flushes are for address ranges of same process, doing those in a batch
>> is more efficient while still being safe.  Modify madvise() and
>> process_madvise() entry level code path to do such batched tlb flushes,
>> while the internal unmap logics do only gathering of the tlb entries to
>> flush.
> 
> Do real applications actually do madvise requests that span multiple
> VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> call to mmap [1], so why would it make sense for an application to
> call madvise() across a VMA boundary?

I had the same question. If this happens in an app, I would assume that 
a single MADV_DONTNEED call would usually not span multiples VMAs, and 
if it does, not that many (and that often) that we would really care 
about it.

OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED 
version would make more sense to me. I don't recall if process_madvise() 
allows for that already, and if it does, is this series primarily 
tackling optimizing that?

Lorenzo Stoakes March 5, 2025, 7:26 p.m. UTC | #3

On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> On 05.03.25 19:56, Matthew Wilcox wrote:
> > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > can happen for each vma of the given address ranges.  Because such tlb
> > > flushes are for address ranges of same process, doing those in a batch
> > > is more efficient while still being safe.  Modify madvise() and
> > > process_madvise() entry level code path to do such batched tlb flushes,
> > > while the internal unmap logics do only gathering of the tlb entries to
> > > flush.
> >
> > Do real applications actually do madvise requests that span multiple
> > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > call to mmap [1], so why would it make sense for an application to
> > call madvise() across a VMA boundary?
>
> I had the same question. If this happens in an app, I would assume that a
> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> does, not that many (and that often) that we would really care about it.
>
> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> would make more sense to me. I don't recall if process_madvise() allows for
> that already, and if it does, is this series primarily tackling optimizing
> that?

Yeah it's weird, but people can get caught out by unexpected failures to merge
if they do fun stuff with mremap().

Then again mremap() itself _mandates_ that you only span a single VMA (or part
of one) :)

Can we talk about the _true_ horror show that - you can span multiple VMAs _with
gaps_ and it'll allow you, only it'll return -ENOMEM at the end?

In madvise_walk_vmas():

	for (;;) {
		...

		if (start < vma->vm_start) {
			unmapped_error = -ENOMEM;
			start = vma->vm_start;
			...
		}

		...

		error = visit(vma, &prev, start, tmp, arg);
		if (error)
			return error;

		...
	}

	return unmapped_error;

So, you have no idea if that -ENOMEM is due to a gap, or do to the
operation returning an -ENOMEM?

I mean can we just drop this? Does anybody in their right mind rely on
this? Or is it intentional to deal with somehow a racing unmap?

But, no, we hold an mmap lock so that's not it.

Yeah OK so can we drop this madness? :) or am I missing some very important
detail about why we allow this?

I guess spanning multiple VMAs we _have_ to leave in because plausibly
there are users of that out there?

>
> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand March 5, 2025, 7:35 p.m. UTC | #4

On 05.03.25 20:26, Lorenzo Stoakes wrote:
> On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
>> On 05.03.25 19:56, Matthew Wilcox wrote:
>>> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
>>>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
>>>> can happen for each vma of the given address ranges.  Because such tlb
>>>> flushes are for address ranges of same process, doing those in a batch
>>>> is more efficient while still being safe.  Modify madvise() and
>>>> process_madvise() entry level code path to do such batched tlb flushes,
>>>> while the internal unmap logics do only gathering of the tlb entries to
>>>> flush.
>>>
>>> Do real applications actually do madvise requests that span multiple
>>> VMAs?  It just seems weird to me.  Like, each vma comes from a separate
>>> call to mmap [1], so why would it make sense for an application to
>>> call madvise() across a VMA boundary?
>>
>> I had the same question. If this happens in an app, I would assume that a
>> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
>> does, not that many (and that often) that we would really care about it.
>>
>> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
>> would make more sense to me. I don't recall if process_madvise() allows for
>> that already, and if it does, is this series primarily tackling optimizing
>> that?
> 
> Yeah it's weird, but people can get caught out by unexpected failures to merge
> if they do fun stuff with mremap().
> 
> Then again mremap() itself _mandates_ that you only span a single VMA (or part
> of one) :)

Maybe some garbage collection use cases that shuffle individual pages, 
and later free larger chunks using MADV_DONTNEED. Doesn't sound unlikely.

> 
> Can we talk about the _true_ horror show that - you can span multiple VMAs _with
> gaps_ and it'll allow you, only it'll return -ENOMEM at the end?
> 
> In madvise_walk_vmas():
> 
> 	for (;;) {
> 		...
> 
> 		if (start < vma->vm_start) {
> 			unmapped_error = -ENOMEM;
> 			start = vma->vm_start;
> 			...
> 		}
> 
> 		...
> 
> 		error = visit(vma, &prev, start, tmp, arg);
> 		if (error)
> 			return error;
> 
> 		...
> 	}
> 
> 	return unmapped_error;
> 
> So, you have no idea if that -ENOMEM is due to a gap, or do to the
> operation returning an -ENOMEM?
 > > I mean can we just drop this? Does anybody in their right mind rely on
> this? Or is it intentional to deal with somehow a racing unmap?
 > > But, no, we hold an mmap lock so that's not it.

Races could still happen if user space would do it from separate 
threads. But then, who would prevent user space from doing another 
mmap() and getting pages zapped ... so that sounds unlikely.

> 
> Yeah OK so can we drop this madness? :) or am I missing some very important
> detail about why we allow this?

I stumbled over that myself a while ago. It's well documented behavior 
in the man page :(

At that point I stopped caring, because apparently somebody else cared 
enough to document that clearly in the man page :)

> 
> I guess spanning multiple VMAs we _have_ to leave in because plausibly
> there are users of that out there?

Spanning multiple VMAs can probably easily happen. At least in QEMU I 
know some sane ways to trigger it on guest memory. But, all corner 
cases, so nothing relevant for performance.

Lorenzo Stoakes March 5, 2025, 7:39 p.m. UTC | #5

On Wed, Mar 05, 2025 at 08:35:36PM +0100, David Hildenbrand wrote:
> On 05.03.25 20:26, Lorenzo Stoakes wrote:
> > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> > > On 05.03.25 19:56, Matthew Wilcox wrote:
> > > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > > > can happen for each vma of the given address ranges.  Because such tlb
> > > > > flushes are for address ranges of same process, doing those in a batch
> > > > > is more efficient while still being safe.  Modify madvise() and
> > > > > process_madvise() entry level code path to do such batched tlb flushes,
> > > > > while the internal unmap logics do only gathering of the tlb entries to
> > > > > flush.
> > > >
> > > > Do real applications actually do madvise requests that span multiple
> > > > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > > > call to mmap [1], so why would it make sense for an application to
> > > > call madvise() across a VMA boundary?
> > >
> > > I had the same question. If this happens in an app, I would assume that a
> > > single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> > > does, not that many (and that often) that we would really care about it.
> > >
> > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> > > would make more sense to me. I don't recall if process_madvise() allows for
> > > that already, and if it does, is this series primarily tackling optimizing
> > > that?
> >
> > Yeah it's weird, but people can get caught out by unexpected failures to merge
> > if they do fun stuff with mremap().
> >
> > Then again mremap() itself _mandates_ that you only span a single VMA (or part
> > of one) :)
>
> Maybe some garbage collection use cases that shuffle individual pages, and
> later free larger chunks using MADV_DONTNEED. Doesn't sound unlikely.
>
> >
> > Can we talk about the _true_ horror show that - you can span multiple VMAs _with
> > gaps_ and it'll allow you, only it'll return -ENOMEM at the end?
> >
> > In madvise_walk_vmas():
> >
> > 	for (;;) {
> > 		...
> >
> > 		if (start < vma->vm_start) {
> > 			unmapped_error = -ENOMEM;
> > 			start = vma->vm_start;
> > 			...
> > 		}
> >
> > 		...
> >
> > 		error = visit(vma, &prev, start, tmp, arg);
> > 		if (error)
> > 			return error;
> >
> > 		...
> > 	}
> >
> > 	return unmapped_error;
> >
> > So, you have no idea if that -ENOMEM is due to a gap, or do to the
> > operation returning an -ENOMEM?
> > > I mean can we just drop this? Does anybody in their right mind rely on
> > this? Or is it intentional to deal with somehow a racing unmap?
> > > But, no, we hold an mmap lock so that's not it.
>
> Races could still happen if user space would do it from separate threads.
> But then, who would prevent user space from doing another mmap() and getting
> pages zapped ... so that sounds unlikely.

Ah yeah, I mean if you got unlucky on timing, munmap()/mmap() or mremap() with
MAP_FIXED quickly unmaps on another thread before you grab the mmap lock and fun
times.

I mean for that to happen you'd have to be doing something... very odd or insane
so. But still.

>
> >
> > Yeah OK so can we drop this madness? :) or am I missing some very important
> > detail about why we allow this?
>
> I stumbled over that myself a while ago. It's well documented behavior in
> the man page :(

Haha ok I guess we have to live with it then.

>
> At that point I stopped caring, because apparently somebody else cared
> enough to document that clearly in the man page :)
>
> >
> > I guess spanning multiple VMAs we _have_ to leave in because plausibly
> > there are users of that out there?
>
> Spanning multiple VMAs can probably easily happen. At least in QEMU I know
> some sane ways to trigger it on guest memory. But, all corner cases, so
> nothing relevant for performance.
>
>
> --
> Cheers,
>
> David / dhildenb
>

Shakeel Butt March 5, 2025, 7:46 p.m. UTC | #6

On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> On 05.03.25 19:56, Matthew Wilcox wrote:
> > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > can happen for each vma of the given address ranges.  Because such tlb
> > > flushes are for address ranges of same process, doing those in a batch
> > > is more efficient while still being safe.  Modify madvise() and
> > > process_madvise() entry level code path to do such batched tlb flushes,
> > > while the internal unmap logics do only gathering of the tlb entries to
> > > flush.
> > 
> > Do real applications actually do madvise requests that span multiple
> > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > call to mmap [1], so why would it make sense for an application to
> > call madvise() across a VMA boundary?
> 
> I had the same question. If this happens in an app, I would assume that a
> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> does, not that many (and that often) that we would really care about it.

IMHO madvise() is just an add-on and the real motivation behind this
series is your next point.

> 
> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> would make more sense to me. I don't recall if process_madvise() allows for
> that already, and if it does, is this series primarily tackling optimizing
> that?

Yes process_madvise() allows that and that is what SJ has benchmarked
and reported in the cover letter. In addition, we are adding
process_madvise() support in jemalloc which will land soon.

David Hildenbrand March 5, 2025, 7:49 p.m. UTC | #7

On 05.03.25 20:46, Shakeel Butt wrote:
> On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
>> On 05.03.25 19:56, Matthew Wilcox wrote:
>>> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
>>>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
>>>> can happen for each vma of the given address ranges.  Because such tlb
>>>> flushes are for address ranges of same process, doing those in a batch
>>>> is more efficient while still being safe.  Modify madvise() and
>>>> process_madvise() entry level code path to do such batched tlb flushes,
>>>> while the internal unmap logics do only gathering of the tlb entries to
>>>> flush.
>>>
>>> Do real applications actually do madvise requests that span multiple
>>> VMAs?  It just seems weird to me.  Like, each vma comes from a separate
>>> call to mmap [1], so why would it make sense for an application to
>>> call madvise() across a VMA boundary?
>>
>> I had the same question. If this happens in an app, I would assume that a
>> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
>> does, not that many (and that often) that we would really care about it.
> 
> IMHO madvise() is just an add-on and the real motivation behind this
> series is your next point.
> 
>>
>> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
>> would make more sense to me. I don't recall if process_madvise() allows for
>> that already, and if it does, is this series primarily tackling optimizing
>> that?
> 
> Yes process_madvise() allows that and that is what SJ has benchmarked
> and reported in the cover letter. In addition, we are adding
> process_madvise() support in jemalloc which will land soon.

Makes a lot of sense to me!

Lorenzo Stoakes March 5, 2025, 7:49 p.m. UTC | #8

On Wed, Mar 05, 2025 at 11:46:31AM -0800, Shakeel Butt wrote:
> On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> > On 05.03.25 19:56, Matthew Wilcox wrote:
> > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > > can happen for each vma of the given address ranges.  Because such tlb
> > > > flushes are for address ranges of same process, doing those in a batch
> > > > is more efficient while still being safe.  Modify madvise() and
> > > > process_madvise() entry level code path to do such batched tlb flushes,
> > > > while the internal unmap logics do only gathering of the tlb entries to
> > > > flush.
> > >
> > > Do real applications actually do madvise requests that span multiple
> > > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > > call to mmap [1], so why would it make sense for an application to
> > > call madvise() across a VMA boundary?
> >
> > I had the same question. If this happens in an app, I would assume that a
> > single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> > does, not that many (and that often) that we would really care about it.
>
> IMHO madvise() is just an add-on and the real motivation behind this
> series is your next point.
>
> >
> > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> > would make more sense to me. I don't recall if process_madvise() allows for
> > that already, and if it does, is this series primarily tackling optimizing
> > that?
>
> Yes process_madvise() allows that and that is what SJ has benchmarked
> and reported in the cover letter. In addition, we are adding
> process_madvise() support in jemalloc which will land soon.
>

Feels like me adjusting that to allow for batched usage for guard regions
has opened up unexpected avenues, which is really cool to see :)

I presume the usage is intended for PIDFD_SELF usage right?

At some point we need to look at allowing larger iovec size. This was
something I was planning to look at at some point, but my workload is
really overwhelming + that's low priority for me so happy for you guys to
handle that if you want.

Can discuss at lsf if you guys will be there also :)

Shakeel Butt March 5, 2025, 7:57 p.m. UTC | #9

On Wed, Mar 05, 2025 at 07:49:50PM +0000, Lorenzo Stoakes wrote:
> On Wed, Mar 05, 2025 at 11:46:31AM -0800, Shakeel Butt wrote:
> > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> > > On 05.03.25 19:56, Matthew Wilcox wrote:
> > > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > > > can happen for each vma of the given address ranges.  Because such tlb
> > > > > flushes are for address ranges of same process, doing those in a batch
> > > > > is more efficient while still being safe.  Modify madvise() and
> > > > > process_madvise() entry level code path to do such batched tlb flushes,
> > > > > while the internal unmap logics do only gathering of the tlb entries to
> > > > > flush.
> > > >
> > > > Do real applications actually do madvise requests that span multiple
> > > > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > > > call to mmap [1], so why would it make sense for an application to
> > > > call madvise() across a VMA boundary?
> > >
> > > I had the same question. If this happens in an app, I would assume that a
> > > single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> > > does, not that many (and that often) that we would really care about it.
> >
> > IMHO madvise() is just an add-on and the real motivation behind this
> > series is your next point.
> >
> > >
> > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> > > would make more sense to me. I don't recall if process_madvise() allows for
> > > that already, and if it does, is this series primarily tackling optimizing
> > > that?
> >
> > Yes process_madvise() allows that and that is what SJ has benchmarked
> > and reported in the cover letter. In addition, we are adding
> > process_madvise() support in jemalloc which will land soon.
> >
> 
> Feels like me adjusting that to allow for batched usage for guard regions
> has opened up unexpected avenues, which is really cool to see :)
> 
> I presume the usage is intended for PIDFD_SELF usage right?

Yes.

> 
> At some point we need to look at allowing larger iovec size. This was
> something I was planning to look at at some point, but my workload is
> really overwhelming + that's low priority for me so happy for you guys to
> handle that if you want.
> 
> Can discuss at lsf if you guys will be there also :)

Yup, we will be there and will be happy to discuss.

Also the draft of jemalloc using process_madvise() is at [1].

[1] https://github.com/jemalloc/jemalloc/pull/2794

Shakeel Butt March 5, 2025, 8:22 p.m. UTC | #10

On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> can happen for each vma of the given address ranges.  Because such tlb
> flushes are for address ranges of same process, doing those in a batch
> is more efficient while still being safe.  Modify madvise() and
> process_madvise() entry level code path to do such batched tlb flushes,
> while the internal unmap logics do only gathering of the tlb entries to
> flush.
> 
> In more detail, modify the entry functions to initialize an mmu_gather
> ojbect and pass it to the internal logics.  Also modify the internal
> logics to do only gathering of the tlb entries to flush into the
> received mmu_gather object.  After all internal function calls are done,
> the entry functions finish the mmu_gather object to flush the gathered
> tlb entries in the one batch.
> 
> Patches Seuquence
> =================
> 
> First four patches are minor cleanups of madvise.c for readability.
> 
> Following four patches (patches 5-8) define new data structure for
> managing information that required for batched tlb flushing (mmu_gather
> and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and
> MADV_FREE handling internal logics to receive it.
> 
> Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED]
> and MADV_FREE handling logic ready for batched tlb flushing follow. 

I think you forgot to complete the above sentence or the 'follow' at the
end seems weird.

> The
> patches keep the support of unbatched tlb flushes use case, for
> fine-grained and safe transitions.
> 
> Next three patches (patches 12-14) update madvise() and
> process_madvise() code to do the batched tlb flushes utilizing the
> previous patches introduced changes.
> 
> Final two patches (patches 15-16) clean up the internal logics'
> unbatched tlb flushes use case support code, which is no more be used.
> 
> Test Results
> ============
> 
> I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple process_madvise() calls.  I apply the advice in 4 KiB
> sized regions granularity, but with varying batch size (vlen) from 1 to
> 1024.  The source code for the measurement is available at GitHub[1].
> 
> The measurement results are as below.  'sz_batches' column shows the
> batch size of process_madvise() calls.  'before' and 'after' columns are
> the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in
> nanoseconds, on kernels that built without and with the MADV_DONTNEED
> tlb flushes batching patch of this series, respectively.  For the
> baseline, mm-unstable tree of 2025-03-04[2] has been used.
> 'after/before' column is the ratio of 'after' to 'before'.  So
> 'afetr/before' value lower than 1.0 means this patch increased
> efficiency over the baseline.  And lower value means better efficiency.

I would recommend to replace the after/end column with percentage i.e.
percentage improvement or degradation.

> 
>     sz_batches    before       after        after/before
>     1             102842895    106507398    1.03563204828102
>     2             73364942     74529223     1.01586971880929
>     4             58823633     51608504     0.877343022998937
>     8             47532390     44820223     0.942940655834895
>     16            43591587     36727177     0.842529018271347
>     32            44207282     33946975     0.767904595446515
>     64            41832437     26738286     0.639175910310939
>     128           40278193     23262940     0.577556694263817
>     256           41568533     22355103     0.537789077136785
>     512           41626638     22822516     0.54826709762148
>     1024          44440870     22676017     0.510251419470411
> 
> For <=2 batch size, tlb flushes batching shows no big difference but
> slight overhead.  I think that's in an error range of this simple
> micro-benchmark, and therefore can be ignored.  

I would recommend to run the experiment multiple times and report
averages and standard deviation which will support your error range
claim.

> Starting from batch size
> 4, however, tlb flushes batching shows clear efficiency gain.  The
> efficiency gain tends to be proportional to the batch size, as expected.
> The efficiency gain ranges from about 13 percent with batch size 4, and
> up to 49 percent with batch size 1,024.
> 
> Please note that this is a very simple microbenchmark, so real
> efficiency gain on real workload could be very different.
> 

I think you are running a single thread benchmark on a free machine. I
expect this series to be much more beneficial on loaded machine and for
multi-threaded applications. No need to test that scenario but if you
have already done that then it would be good to report.

Nadav Amit March 5, 2025, 8:36 p.m. UTC | #11

> On 5 Mar 2025, at 20:15, SeongJae Park <sj@kernel.org> wrote:
> 
> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> can happen for each vma of the given address ranges.  Because such tlb
> flushes are for address ranges of same process, doing those in a batch
> is more efficient while still being safe.  Modify madvise() and
> process_madvise() entry level code path to do such batched tlb flushes,
> while the internal unmap logics do only gathering of the tlb entries to
> flush.

I made some related (similar?) patches in the past. You can see if you
find something useful in the discussion there. I think your version avoids
some of the “mistakes” I made.

[1] https://lore.kernel.org/all/20210926161259.238054-1-namit@vmware.com/T/#m23ccd29bad04a963c4d8c64ec3581f7c301c7806

SeongJae Park March 5, 2025, 8:59 p.m. UTC | #12

On Wed, 5 Mar 2025 20:49:13 +0100 David Hildenbrand <david@redhat.com> wrote:

> On 05.03.25 20:46, Shakeel Butt wrote:
> > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> >> On 05.03.25 19:56, Matthew Wilcox wrote:
> >>> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> >>>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> >>>> can happen for each vma of the given address ranges.  Because such tlb
> >>>> flushes are for address ranges of same process, doing those in a batch
> >>>> is more efficient while still being safe.  Modify madvise() and
> >>>> process_madvise() entry level code path to do such batched tlb flushes,
> >>>> while the internal unmap logics do only gathering of the tlb entries to
> >>>> flush.
> >>>
> >>> Do real applications actually do madvise requests that span multiple
> >>> VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> >>> call to mmap [1], so why would it make sense for an application to
> >>> call madvise() across a VMA boundary?
> >>
> >> I had the same question. If this happens in an app, I would assume that a
> >> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> >> does, not that many (and that often) that we would really care about it.
> > 
> > IMHO madvise() is just an add-on and the real motivation behind this
> > series is your next point.
> > 
> >>
> >> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> >> would make more sense to me. I don't recall if process_madvise() allows for
> >> that already, and if it does, is this series primarily tackling optimizing
> >> that?
> > 
> > Yes process_madvise() allows that and that is what SJ has benchmarked
> > and reported in the cover letter. In addition, we are adding
> > process_madvise() support in jemalloc which will land soon.

Shakeel is correct.  Thank you for making the early clarification Shakeel.

Also sorry for causing confuses.  I will make this point clearer on next spin.

> 
> Makes a lot of sense to me!

Seems Shakeel already addressed all question so far, but please feel free to
raise more question for anything not yet cleared!

> 
> -- 
> Cheers,
> 
> David / dhildenb


Thanks,
SJ

SeongJae Park March 5, 2025, 10:46 p.m. UTC | #13

On Wed, 5 Mar 2025 11:57:34 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 07:49:50PM +0000, Lorenzo Stoakes wrote:
> > On Wed, Mar 05, 2025 at 11:46:31AM -0800, Shakeel Butt wrote:
> > > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
> > > > On 05.03.25 19:56, Matthew Wilcox wrote:
> > > > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > > > > > can happen for each vma of the given address ranges.  Because such tlb
> > > > > > flushes are for address ranges of same process, doing those in a batch
> > > > > > is more efficient while still being safe.  Modify madvise() and
> > > > > > process_madvise() entry level code path to do such batched tlb flushes,
> > > > > > while the internal unmap logics do only gathering of the tlb entries to
> > > > > > flush.
> > > > >
> > > > > Do real applications actually do madvise requests that span multiple
> > > > > VMAs?  It just seems weird to me.  Like, each vma comes from a separate
> > > > > call to mmap [1], so why would it make sense for an application to
> > > > > call madvise() across a VMA boundary?
> > > >
> > > > I had the same question. If this happens in an app, I would assume that a
> > > > single MADV_DONTNEED call would usually not span multiples VMAs, and if it
> > > > does, not that many (and that often) that we would really care about it.
> > >
> > > IMHO madvise() is just an add-on and the real motivation behind this
> > > series is your next point.
> > >
> > > >
> > > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
> > > > would make more sense to me. I don't recall if process_madvise() allows for
> > > > that already, and if it does, is this series primarily tackling optimizing
> > > > that?
> > >
> > > Yes process_madvise() allows that and that is what SJ has benchmarked
> > > and reported in the cover letter. In addition, we are adding
> > > process_madvise() support in jemalloc which will land soon.
> > >
> > 
> > Feels like me adjusting that to allow for batched usage for guard regions
> > has opened up unexpected avenues, which is really cool to see :)
> > 
> > I presume the usage is intended for PIDFD_SELF usage right?
> 
> Yes.

Indeed. Thank you for opening up the door, Lorenzo :)

> 
> > 
> > At some point we need to look at allowing larger iovec size. This was
> > something I was planning to look at at some point, but my workload is
> > really overwhelming + that's low priority for me so happy for you guys to
> > handle that if you want.
> > 
> > Can discuss at lsf if you guys will be there also :)
> 
> Yup, we will be there and will be happy to discuss.

Me, too.  Looking forward to the day!


Thanks,
SJ

[...]

SeongJae Park March 5, 2025, 10:58 p.m. UTC | #14

On Wed, 5 Mar 2025 12:22:25 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
> > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > can happen for each vma of the given address ranges.  Because such tlb
> > flushes are for address ranges of same process, doing those in a batch
> > is more efficient while still being safe.  Modify madvise() and
> > process_madvise() entry level code path to do such batched tlb flushes,
> > while the internal unmap logics do only gathering of the tlb entries to
> > flush.
> > 
> > In more detail, modify the entry functions to initialize an mmu_gather
> > ojbect and pass it to the internal logics.  Also modify the internal
> > logics to do only gathering of the tlb entries to flush into the
> > received mmu_gather object.  After all internal function calls are done,
> > the entry functions finish the mmu_gather object to flush the gathered
> > tlb entries in the one batch.
> > 
> > Patches Seuquence
> > =================
> > 
> > First four patches are minor cleanups of madvise.c for readability.
> > 
> > Following four patches (patches 5-8) define new data structure for
> > managing information that required for batched tlb flushing (mmu_gather
> > and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and
> > MADV_FREE handling internal logics to receive it.
> > 
> > Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED]
> > and MADV_FREE handling logic ready for batched tlb flushing follow. 
> 
> I think you forgot to complete the above sentence or the 'follow' at the
> end seems weird.

Thank you for catching this.  I just wanted to say these three patches come
after the previous ones.  I will wordsmith this part in the next version.

> 
> > The
> > patches keep the support of unbatched tlb flushes use case, for
> > fine-grained and safe transitions.
> > 
> > Next three patches (patches 12-14) update madvise() and
> > process_madvise() code to do the batched tlb flushes utilizing the
> > previous patches introduced changes.
> > 
> > Final two patches (patches 15-16) clean up the internal logics'
> > unbatched tlb flushes use case support code, which is no more be used.
> > 
> > Test Results
> > ============
> > 
> > I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> > using multiple process_madvise() calls.  I apply the advice in 4 KiB
> > sized regions granularity, but with varying batch size (vlen) from 1 to
> > 1024.  The source code for the measurement is available at GitHub[1].
> > 
> > The measurement results are as below.  'sz_batches' column shows the
> > batch size of process_madvise() calls.  'before' and 'after' columns are
> > the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in
> > nanoseconds, on kernels that built without and with the MADV_DONTNEED
> > tlb flushes batching patch of this series, respectively.  For the
> > baseline, mm-unstable tree of 2025-03-04[2] has been used.
> > 'after/before' column is the ratio of 'after' to 'before'.  So
> > 'afetr/before' value lower than 1.0 means this patch increased
> > efficiency over the baseline.  And lower value means better efficiency.
> 
> I would recommend to replace the after/end column with percentage i.e.
> percentage improvement or degradation.

Thank you for the nice suggestion.  I will do so in the next version.

> 
> > 
> >     sz_batches    before       after        after/before
> >     1             102842895    106507398    1.03563204828102
> >     2             73364942     74529223     1.01586971880929
> >     4             58823633     51608504     0.877343022998937
> >     8             47532390     44820223     0.942940655834895
> >     16            43591587     36727177     0.842529018271347
> >     32            44207282     33946975     0.767904595446515
> >     64            41832437     26738286     0.639175910310939
> >     128           40278193     23262940     0.577556694263817
> >     256           41568533     22355103     0.537789077136785
> >     512           41626638     22822516     0.54826709762148
> >     1024          44440870     22676017     0.510251419470411
> > 
> > For <=2 batch size, tlb flushes batching shows no big difference but
> > slight overhead.  I think that's in an error range of this simple
> > micro-benchmark, and therefore can be ignored.  
> 
> I would recommend to run the experiment multiple times and report
> averages and standard deviation which will support your error range
> claim.

Again, good suggestion.  I will do so.

> 
> > Starting from batch size
> > 4, however, tlb flushes batching shows clear efficiency gain.  The
> > efficiency gain tends to be proportional to the batch size, as expected.
> > The efficiency gain ranges from about 13 percent with batch size 4, and
> > up to 49 percent with batch size 1,024.
> > 
> > Please note that this is a very simple microbenchmark, so real
> > efficiency gain on real workload could be very different.
> > 
> 
> I think you are running a single thread benchmark on a free machine. I
> expect this series to be much more beneficial on loaded machine and for
> multi-threaded applications.

Your understanding of my test setup is correct and I agree to your expectation.

> No need to test that scenario but if you
> have already done that then it would be good to report.

I don't have such test results or plans for those with specific timeline for
now.  I will share those if I get a chance, of course.


Thanks,
SJ

SeongJae Park March 5, 2025, 11:02 p.m. UTC | #15

On Wed, 5 Mar 2025 22:36:45 +0200 Nadav Amit <nadav.amit@gmail.com> wrote:

> 
> > On 5 Mar 2025, at 20:15, SeongJae Park <sj@kernel.org> wrote:
> > 
> > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
> > can happen for each vma of the given address ranges.  Because such tlb
> > flushes are for address ranges of same process, doing those in a batch
> > is more efficient while still being safe.  Modify madvise() and
> > process_madvise() entry level code path to do such batched tlb flushes,
> > while the internal unmap logics do only gathering of the tlb entries to
> > flush.
> 
> I made some related (similar?) patches in the past. You can see if you
> find something useful in the discussion there. I think your version avoids
> some of the “mistakes” I made.

Thank you for sharing this!  I sure I can learn and use many things from this
work and previous discussions.

> 
> [1] https://lore.kernel.org/all/20210926161259.238054-1-namit@vmware.com/T/#m23ccd29bad04a963c4d8c64ec3581f7c301c7806


Thanks,
SJ

[RFC,00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

Message

Comments