Message ID | 20250305181611.54484-1-sj@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE | expand |
On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > can happen for each vma of the given address ranges. Because such tlb > flushes are for address ranges of same process, doing those in a batch > is more efficient while still being safe. Modify madvise() and > process_madvise() entry level code path to do such batched tlb flushes, > while the internal unmap logics do only gathering of the tlb entries to > flush. Do real applications actually do madvise requests that span multiple VMAs? It just seems weird to me. Like, each vma comes from a separate call to mmap [1], so why would it make sense for an application to call madvise() across a VMA boundary? [1] Yes, I know we sometimes merge and/or split VMAs
On 05.03.25 19:56, Matthew Wilcox wrote: > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: >> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes >> can happen for each vma of the given address ranges. Because such tlb >> flushes are for address ranges of same process, doing those in a batch >> is more efficient while still being safe. Modify madvise() and >> process_madvise() entry level code path to do such batched tlb flushes, >> while the internal unmap logics do only gathering of the tlb entries to >> flush. > > Do real applications actually do madvise requests that span multiple > VMAs? It just seems weird to me. Like, each vma comes from a separate > call to mmap [1], so why would it make sense for an application to > call madvise() across a VMA boundary? I had the same question. If this happens in an app, I would assume that a single MADV_DONTNEED call would usually not span multiples VMAs, and if it does, not that many (and that often) that we would really care about it. OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version would make more sense to me. I don't recall if process_madvise() allows for that already, and if it does, is this series primarily tackling optimizing that?
On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote: > On 05.03.25 19:56, Matthew Wilcox wrote: > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > > > can happen for each vma of the given address ranges. Because such tlb > > > flushes are for address ranges of same process, doing those in a batch > > > is more efficient while still being safe. Modify madvise() and > > > process_madvise() entry level code path to do such batched tlb flushes, > > > while the internal unmap logics do only gathering of the tlb entries to > > > flush. > > > > Do real applications actually do madvise requests that span multiple > > VMAs? It just seems weird to me. Like, each vma comes from a separate > > call to mmap [1], so why would it make sense for an application to > > call madvise() across a VMA boundary? > > I had the same question. If this happens in an app, I would assume that a > single MADV_DONTNEED call would usually not span multiples VMAs, and if it > does, not that many (and that often) that we would really care about it. > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version > would make more sense to me. I don't recall if process_madvise() allows for > that already, and if it does, is this series primarily tackling optimizing > that? Yeah it's weird, but people can get caught out by unexpected failures to merge if they do fun stuff with mremap(). Then again mremap() itself _mandates_ that you only span a single VMA (or part of one) :) Can we talk about the _true_ horror show that - you can span multiple VMAs _with gaps_ and it'll allow you, only it'll return -ENOMEM at the end? In madvise_walk_vmas(): for (;;) { ... if (start < vma->vm_start) { unmapped_error = -ENOMEM; start = vma->vm_start; ... } ... error = visit(vma, &prev, start, tmp, arg); if (error) return error; ... } return unmapped_error; So, you have no idea if that -ENOMEM is due to a gap, or do to the operation returning an -ENOMEM? I mean can we just drop this? Does anybody in their right mind rely on this? Or is it intentional to deal with somehow a racing unmap? But, no, we hold an mmap lock so that's not it. Yeah OK so can we drop this madness? :) or am I missing some very important detail about why we allow this? I guess spanning multiple VMAs we _have_ to leave in because plausibly there are users of that out there? > > -- > Cheers, > > David / dhildenb >
On 05.03.25 20:26, Lorenzo Stoakes wrote: > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote: >> On 05.03.25 19:56, Matthew Wilcox wrote: >>> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: >>>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes >>>> can happen for each vma of the given address ranges. Because such tlb >>>> flushes are for address ranges of same process, doing those in a batch >>>> is more efficient while still being safe. Modify madvise() and >>>> process_madvise() entry level code path to do such batched tlb flushes, >>>> while the internal unmap logics do only gathering of the tlb entries to >>>> flush. >>> >>> Do real applications actually do madvise requests that span multiple >>> VMAs? It just seems weird to me. Like, each vma comes from a separate >>> call to mmap [1], so why would it make sense for an application to >>> call madvise() across a VMA boundary? >> >> I had the same question. If this happens in an app, I would assume that a >> single MADV_DONTNEED call would usually not span multiples VMAs, and if it >> does, not that many (and that often) that we would really care about it. >> >> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version >> would make more sense to me. I don't recall if process_madvise() allows for >> that already, and if it does, is this series primarily tackling optimizing >> that? > > Yeah it's weird, but people can get caught out by unexpected failures to merge > if they do fun stuff with mremap(). > > Then again mremap() itself _mandates_ that you only span a single VMA (or part > of one) :) Maybe some garbage collection use cases that shuffle individual pages, and later free larger chunks using MADV_DONTNEED. Doesn't sound unlikely. > > Can we talk about the _true_ horror show that - you can span multiple VMAs _with > gaps_ and it'll allow you, only it'll return -ENOMEM at the end? > > In madvise_walk_vmas(): > > for (;;) { > ... > > if (start < vma->vm_start) { > unmapped_error = -ENOMEM; > start = vma->vm_start; > ... > } > > ... > > error = visit(vma, &prev, start, tmp, arg); > if (error) > return error; > > ... > } > > return unmapped_error; > > So, you have no idea if that -ENOMEM is due to a gap, or do to the > operation returning an -ENOMEM? > > I mean can we just drop this? Does anybody in their right mind rely on > this? Or is it intentional to deal with somehow a racing unmap? > > But, no, we hold an mmap lock so that's not it. Races could still happen if user space would do it from separate threads. But then, who would prevent user space from doing another mmap() and getting pages zapped ... so that sounds unlikely. > > Yeah OK so can we drop this madness? :) or am I missing some very important > detail about why we allow this? I stumbled over that myself a while ago. It's well documented behavior in the man page :( At that point I stopped caring, because apparently somebody else cared enough to document that clearly in the man page :) > > I guess spanning multiple VMAs we _have_ to leave in because plausibly > there are users of that out there? Spanning multiple VMAs can probably easily happen. At least in QEMU I know some sane ways to trigger it on guest memory. But, all corner cases, so nothing relevant for performance.
On Wed, Mar 05, 2025 at 08:35:36PM +0100, David Hildenbrand wrote: > On 05.03.25 20:26, Lorenzo Stoakes wrote: > > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote: > > > On 05.03.25 19:56, Matthew Wilcox wrote: > > > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: > > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > > > > > can happen for each vma of the given address ranges. Because such tlb > > > > > flushes are for address ranges of same process, doing those in a batch > > > > > is more efficient while still being safe. Modify madvise() and > > > > > process_madvise() entry level code path to do such batched tlb flushes, > > > > > while the internal unmap logics do only gathering of the tlb entries to > > > > > flush. > > > > > > > > Do real applications actually do madvise requests that span multiple > > > > VMAs? It just seems weird to me. Like, each vma comes from a separate > > > > call to mmap [1], so why would it make sense for an application to > > > > call madvise() across a VMA boundary? > > > > > > I had the same question. If this happens in an app, I would assume that a > > > single MADV_DONTNEED call would usually not span multiples VMAs, and if it > > > does, not that many (and that often) that we would really care about it. > > > > > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version > > > would make more sense to me. I don't recall if process_madvise() allows for > > > that already, and if it does, is this series primarily tackling optimizing > > > that? > > > > Yeah it's weird, but people can get caught out by unexpected failures to merge > > if they do fun stuff with mremap(). > > > > Then again mremap() itself _mandates_ that you only span a single VMA (or part > > of one) :) > > Maybe some garbage collection use cases that shuffle individual pages, and > later free larger chunks using MADV_DONTNEED. Doesn't sound unlikely. > > > > > Can we talk about the _true_ horror show that - you can span multiple VMAs _with > > gaps_ and it'll allow you, only it'll return -ENOMEM at the end? > > > > In madvise_walk_vmas(): > > > > for (;;) { > > ... > > > > if (start < vma->vm_start) { > > unmapped_error = -ENOMEM; > > start = vma->vm_start; > > ... > > } > > > > ... > > > > error = visit(vma, &prev, start, tmp, arg); > > if (error) > > return error; > > > > ... > > } > > > > return unmapped_error; > > > > So, you have no idea if that -ENOMEM is due to a gap, or do to the > > operation returning an -ENOMEM? > > > I mean can we just drop this? Does anybody in their right mind rely on > > this? Or is it intentional to deal with somehow a racing unmap? > > > But, no, we hold an mmap lock so that's not it. > > Races could still happen if user space would do it from separate threads. > But then, who would prevent user space from doing another mmap() and getting > pages zapped ... so that sounds unlikely. Ah yeah, I mean if you got unlucky on timing, munmap()/mmap() or mremap() with MAP_FIXED quickly unmaps on another thread before you grab the mmap lock and fun times. I mean for that to happen you'd have to be doing something... very odd or insane so. But still. > > > > > Yeah OK so can we drop this madness? :) or am I missing some very important > > detail about why we allow this? > > I stumbled over that myself a while ago. It's well documented behavior in > the man page :( Haha ok I guess we have to live with it then. > > At that point I stopped caring, because apparently somebody else cared > enough to document that clearly in the man page :) > > > > > I guess spanning multiple VMAs we _have_ to leave in because plausibly > > there are users of that out there? > > Spanning multiple VMAs can probably easily happen. At least in QEMU I know > some sane ways to trigger it on guest memory. But, all corner cases, so > nothing relevant for performance. > > > -- > Cheers, > > David / dhildenb >
On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote: > On 05.03.25 19:56, Matthew Wilcox wrote: > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > > > can happen for each vma of the given address ranges. Because such tlb > > > flushes are for address ranges of same process, doing those in a batch > > > is more efficient while still being safe. Modify madvise() and > > > process_madvise() entry level code path to do such batched tlb flushes, > > > while the internal unmap logics do only gathering of the tlb entries to > > > flush. > > > > Do real applications actually do madvise requests that span multiple > > VMAs? It just seems weird to me. Like, each vma comes from a separate > > call to mmap [1], so why would it make sense for an application to > > call madvise() across a VMA boundary? > > I had the same question. If this happens in an app, I would assume that a > single MADV_DONTNEED call would usually not span multiples VMAs, and if it > does, not that many (and that often) that we would really care about it. IMHO madvise() is just an add-on and the real motivation behind this series is your next point. > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version > would make more sense to me. I don't recall if process_madvise() allows for > that already, and if it does, is this series primarily tackling optimizing > that? Yes process_madvise() allows that and that is what SJ has benchmarked and reported in the cover letter. In addition, we are adding process_madvise() support in jemalloc which will land soon.
On 05.03.25 20:46, Shakeel Butt wrote: > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote: >> On 05.03.25 19:56, Matthew Wilcox wrote: >>> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: >>>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes >>>> can happen for each vma of the given address ranges. Because such tlb >>>> flushes are for address ranges of same process, doing those in a batch >>>> is more efficient while still being safe. Modify madvise() and >>>> process_madvise() entry level code path to do such batched tlb flushes, >>>> while the internal unmap logics do only gathering of the tlb entries to >>>> flush. >>> >>> Do real applications actually do madvise requests that span multiple >>> VMAs? It just seems weird to me. Like, each vma comes from a separate >>> call to mmap [1], so why would it make sense for an application to >>> call madvise() across a VMA boundary? >> >> I had the same question. If this happens in an app, I would assume that a >> single MADV_DONTNEED call would usually not span multiples VMAs, and if it >> does, not that many (and that often) that we would really care about it. > > IMHO madvise() is just an add-on and the real motivation behind this > series is your next point. > >> >> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version >> would make more sense to me. I don't recall if process_madvise() allows for >> that already, and if it does, is this series primarily tackling optimizing >> that? > > Yes process_madvise() allows that and that is what SJ has benchmarked > and reported in the cover letter. In addition, we are adding > process_madvise() support in jemalloc which will land soon. Makes a lot of sense to me!
On Wed, Mar 05, 2025 at 11:46:31AM -0800, Shakeel Butt wrote: > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote: > > On 05.03.25 19:56, Matthew Wilcox wrote: > > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > > > > can happen for each vma of the given address ranges. Because such tlb > > > > flushes are for address ranges of same process, doing those in a batch > > > > is more efficient while still being safe. Modify madvise() and > > > > process_madvise() entry level code path to do such batched tlb flushes, > > > > while the internal unmap logics do only gathering of the tlb entries to > > > > flush. > > > > > > Do real applications actually do madvise requests that span multiple > > > VMAs? It just seems weird to me. Like, each vma comes from a separate > > > call to mmap [1], so why would it make sense for an application to > > > call madvise() across a VMA boundary? > > > > I had the same question. If this happens in an app, I would assume that a > > single MADV_DONTNEED call would usually not span multiples VMAs, and if it > > does, not that many (and that often) that we would really care about it. > > IMHO madvise() is just an add-on and the real motivation behind this > series is your next point. > > > > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version > > would make more sense to me. I don't recall if process_madvise() allows for > > that already, and if it does, is this series primarily tackling optimizing > > that? > > Yes process_madvise() allows that and that is what SJ has benchmarked > and reported in the cover letter. In addition, we are adding > process_madvise() support in jemalloc which will land soon. > Feels like me adjusting that to allow for batched usage for guard regions has opened up unexpected avenues, which is really cool to see :) I presume the usage is intended for PIDFD_SELF usage right? At some point we need to look at allowing larger iovec size. This was something I was planning to look at at some point, but my workload is really overwhelming + that's low priority for me so happy for you guys to handle that if you want. Can discuss at lsf if you guys will be there also :)
On Wed, Mar 05, 2025 at 07:49:50PM +0000, Lorenzo Stoakes wrote: > On Wed, Mar 05, 2025 at 11:46:31AM -0800, Shakeel Butt wrote: > > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote: > > > On 05.03.25 19:56, Matthew Wilcox wrote: > > > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: > > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > > > > > can happen for each vma of the given address ranges. Because such tlb > > > > > flushes are for address ranges of same process, doing those in a batch > > > > > is more efficient while still being safe. Modify madvise() and > > > > > process_madvise() entry level code path to do such batched tlb flushes, > > > > > while the internal unmap logics do only gathering of the tlb entries to > > > > > flush. > > > > > > > > Do real applications actually do madvise requests that span multiple > > > > VMAs? It just seems weird to me. Like, each vma comes from a separate > > > > call to mmap [1], so why would it make sense for an application to > > > > call madvise() across a VMA boundary? > > > > > > I had the same question. If this happens in an app, I would assume that a > > > single MADV_DONTNEED call would usually not span multiples VMAs, and if it > > > does, not that many (and that often) that we would really care about it. > > > > IMHO madvise() is just an add-on and the real motivation behind this > > series is your next point. > > > > > > > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version > > > would make more sense to me. I don't recall if process_madvise() allows for > > > that already, and if it does, is this series primarily tackling optimizing > > > that? > > > > Yes process_madvise() allows that and that is what SJ has benchmarked > > and reported in the cover letter. In addition, we are adding > > process_madvise() support in jemalloc which will land soon. > > > > Feels like me adjusting that to allow for batched usage for guard regions > has opened up unexpected avenues, which is really cool to see :) > > I presume the usage is intended for PIDFD_SELF usage right? Yes. > > At some point we need to look at allowing larger iovec size. This was > something I was planning to look at at some point, but my workload is > really overwhelming + that's low priority for me so happy for you guys to > handle that if you want. > > Can discuss at lsf if you guys will be there also :) Yup, we will be there and will be happy to discuss. Also the draft of jemalloc using process_madvise() is at [1]. [1] https://github.com/jemalloc/jemalloc/pull/2794
On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > can happen for each vma of the given address ranges. Because such tlb > flushes are for address ranges of same process, doing those in a batch > is more efficient while still being safe. Modify madvise() and > process_madvise() entry level code path to do such batched tlb flushes, > while the internal unmap logics do only gathering of the tlb entries to > flush. > > In more detail, modify the entry functions to initialize an mmu_gather > ojbect and pass it to the internal logics. Also modify the internal > logics to do only gathering of the tlb entries to flush into the > received mmu_gather object. After all internal function calls are done, > the entry functions finish the mmu_gather object to flush the gathered > tlb entries in the one batch. > > Patches Seuquence > ================= > > First four patches are minor cleanups of madvise.c for readability. > > Following four patches (patches 5-8) define new data structure for > managing information that required for batched tlb flushing (mmu_gather > and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and > MADV_FREE handling internal logics to receive it. > > Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED] > and MADV_FREE handling logic ready for batched tlb flushing follow. I think you forgot to complete the above sentence or the 'follow' at the end seems weird. > The > patches keep the support of unbatched tlb flushes use case, for > fine-grained and safe transitions. > > Next three patches (patches 12-14) update madvise() and > process_madvise() code to do the batched tlb flushes utilizing the > previous patches introduced changes. > > Final two patches (patches 15-16) clean up the internal logics' > unbatched tlb flushes use case support code, which is no more be used. > > Test Results > ============ > > I measured the time to apply MADV_DONTNEED advice to 256 MiB memory > using multiple process_madvise() calls. I apply the advice in 4 KiB > sized regions granularity, but with varying batch size (vlen) from 1 to > 1024. The source code for the measurement is available at GitHub[1]. > > The measurement results are as below. 'sz_batches' column shows the > batch size of process_madvise() calls. 'before' and 'after' columns are > the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in > nanoseconds, on kernels that built without and with the MADV_DONTNEED > tlb flushes batching patch of this series, respectively. For the > baseline, mm-unstable tree of 2025-03-04[2] has been used. > 'after/before' column is the ratio of 'after' to 'before'. So > 'afetr/before' value lower than 1.0 means this patch increased > efficiency over the baseline. And lower value means better efficiency. I would recommend to replace the after/end column with percentage i.e. percentage improvement or degradation. > > sz_batches before after after/before > 1 102842895 106507398 1.03563204828102 > 2 73364942 74529223 1.01586971880929 > 4 58823633 51608504 0.877343022998937 > 8 47532390 44820223 0.942940655834895 > 16 43591587 36727177 0.842529018271347 > 32 44207282 33946975 0.767904595446515 > 64 41832437 26738286 0.639175910310939 > 128 40278193 23262940 0.577556694263817 > 256 41568533 22355103 0.537789077136785 > 512 41626638 22822516 0.54826709762148 > 1024 44440870 22676017 0.510251419470411 > > For <=2 batch size, tlb flushes batching shows no big difference but > slight overhead. I think that's in an error range of this simple > micro-benchmark, and therefore can be ignored. I would recommend to run the experiment multiple times and report averages and standard deviation which will support your error range claim. > Starting from batch size > 4, however, tlb flushes batching shows clear efficiency gain. The > efficiency gain tends to be proportional to the batch size, as expected. > The efficiency gain ranges from about 13 percent with batch size 4, and > up to 49 percent with batch size 1,024. > > Please note that this is a very simple microbenchmark, so real > efficiency gain on real workload could be very different. > I think you are running a single thread benchmark on a free machine. I expect this series to be much more beneficial on loaded machine and for multi-threaded applications. No need to test that scenario but if you have already done that then it would be good to report.
> On 5 Mar 2025, at 20:15, SeongJae Park <sj@kernel.org> wrote: > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > can happen for each vma of the given address ranges. Because such tlb > flushes are for address ranges of same process, doing those in a batch > is more efficient while still being safe. Modify madvise() and > process_madvise() entry level code path to do such batched tlb flushes, > while the internal unmap logics do only gathering of the tlb entries to > flush. I made some related (similar?) patches in the past. You can see if you find something useful in the discussion there. I think your version avoids some of the “mistakes” I made. [1] https://lore.kernel.org/all/20210926161259.238054-1-namit@vmware.com/T/#m23ccd29bad04a963c4d8c64ec3581f7c301c7806
On Wed, 5 Mar 2025 20:49:13 +0100 David Hildenbrand <david@redhat.com> wrote: > On 05.03.25 20:46, Shakeel Butt wrote: > > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote: > >> On 05.03.25 19:56, Matthew Wilcox wrote: > >>> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: > >>>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > >>>> can happen for each vma of the given address ranges. Because such tlb > >>>> flushes are for address ranges of same process, doing those in a batch > >>>> is more efficient while still being safe. Modify madvise() and > >>>> process_madvise() entry level code path to do such batched tlb flushes, > >>>> while the internal unmap logics do only gathering of the tlb entries to > >>>> flush. > >>> > >>> Do real applications actually do madvise requests that span multiple > >>> VMAs? It just seems weird to me. Like, each vma comes from a separate > >>> call to mmap [1], so why would it make sense for an application to > >>> call madvise() across a VMA boundary? > >> > >> I had the same question. If this happens in an app, I would assume that a > >> single MADV_DONTNEED call would usually not span multiples VMAs, and if it > >> does, not that many (and that often) that we would really care about it. > > > > IMHO madvise() is just an add-on and the real motivation behind this > > series is your next point. > > > >> > >> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version > >> would make more sense to me. I don't recall if process_madvise() allows for > >> that already, and if it does, is this series primarily tackling optimizing > >> that? > > > > Yes process_madvise() allows that and that is what SJ has benchmarked > > and reported in the cover letter. In addition, we are adding > > process_madvise() support in jemalloc which will land soon. Shakeel is correct. Thank you for making the early clarification Shakeel. Also sorry for causing confuses. I will make this point clearer on next spin. > > Makes a lot of sense to me! Seems Shakeel already addressed all question so far, but please feel free to raise more question for anything not yet cleared! > > -- > Cheers, > > David / dhildenb Thanks, SJ
On Wed, 5 Mar 2025 11:57:34 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote: > On Wed, Mar 05, 2025 at 07:49:50PM +0000, Lorenzo Stoakes wrote: > > On Wed, Mar 05, 2025 at 11:46:31AM -0800, Shakeel Butt wrote: > > > On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote: > > > > On 05.03.25 19:56, Matthew Wilcox wrote: > > > > > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: > > > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > > > > > > can happen for each vma of the given address ranges. Because such tlb > > > > > > flushes are for address ranges of same process, doing those in a batch > > > > > > is more efficient while still being safe. Modify madvise() and > > > > > > process_madvise() entry level code path to do such batched tlb flushes, > > > > > > while the internal unmap logics do only gathering of the tlb entries to > > > > > > flush. > > > > > > > > > > Do real applications actually do madvise requests that span multiple > > > > > VMAs? It just seems weird to me. Like, each vma comes from a separate > > > > > call to mmap [1], so why would it make sense for an application to > > > > > call madvise() across a VMA boundary? > > > > > > > > I had the same question. If this happens in an app, I would assume that a > > > > single MADV_DONTNEED call would usually not span multiples VMAs, and if it > > > > does, not that many (and that often) that we would really care about it. > > > > > > IMHO madvise() is just an add-on and the real motivation behind this > > > series is your next point. > > > > > > > > > > > OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version > > > > would make more sense to me. I don't recall if process_madvise() allows for > > > > that already, and if it does, is this series primarily tackling optimizing > > > > that? > > > > > > Yes process_madvise() allows that and that is what SJ has benchmarked > > > and reported in the cover letter. In addition, we are adding > > > process_madvise() support in jemalloc which will land soon. > > > > > > > Feels like me adjusting that to allow for batched usage for guard regions > > has opened up unexpected avenues, which is really cool to see :) > > > > I presume the usage is intended for PIDFD_SELF usage right? > > Yes. Indeed. Thank you for opening up the door, Lorenzo :) > > > > > At some point we need to look at allowing larger iovec size. This was > > something I was planning to look at at some point, but my workload is > > really overwhelming + that's low priority for me so happy for you guys to > > handle that if you want. > > > > Can discuss at lsf if you guys will be there also :) > > Yup, we will be there and will be happy to discuss. Me, too. Looking forward to the day! Thanks, SJ [...]
On Wed, 5 Mar 2025 12:22:25 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote: > On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote: > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > > can happen for each vma of the given address ranges. Because such tlb > > flushes are for address ranges of same process, doing those in a batch > > is more efficient while still being safe. Modify madvise() and > > process_madvise() entry level code path to do such batched tlb flushes, > > while the internal unmap logics do only gathering of the tlb entries to > > flush. > > > > In more detail, modify the entry functions to initialize an mmu_gather > > ojbect and pass it to the internal logics. Also modify the internal > > logics to do only gathering of the tlb entries to flush into the > > received mmu_gather object. After all internal function calls are done, > > the entry functions finish the mmu_gather object to flush the gathered > > tlb entries in the one batch. > > > > Patches Seuquence > > ================= > > > > First four patches are minor cleanups of madvise.c for readability. > > > > Following four patches (patches 5-8) define new data structure for > > managing information that required for batched tlb flushing (mmu_gather > > and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and > > MADV_FREE handling internal logics to receive it. > > > > Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED] > > and MADV_FREE handling logic ready for batched tlb flushing follow. > > I think you forgot to complete the above sentence or the 'follow' at the > end seems weird. Thank you for catching this. I just wanted to say these three patches come after the previous ones. I will wordsmith this part in the next version. > > > The > > patches keep the support of unbatched tlb flushes use case, for > > fine-grained and safe transitions. > > > > Next three patches (patches 12-14) update madvise() and > > process_madvise() code to do the batched tlb flushes utilizing the > > previous patches introduced changes. > > > > Final two patches (patches 15-16) clean up the internal logics' > > unbatched tlb flushes use case support code, which is no more be used. > > > > Test Results > > ============ > > > > I measured the time to apply MADV_DONTNEED advice to 256 MiB memory > > using multiple process_madvise() calls. I apply the advice in 4 KiB > > sized regions granularity, but with varying batch size (vlen) from 1 to > > 1024. The source code for the measurement is available at GitHub[1]. > > > > The measurement results are as below. 'sz_batches' column shows the > > batch size of process_madvise() calls. 'before' and 'after' columns are > > the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in > > nanoseconds, on kernels that built without and with the MADV_DONTNEED > > tlb flushes batching patch of this series, respectively. For the > > baseline, mm-unstable tree of 2025-03-04[2] has been used. > > 'after/before' column is the ratio of 'after' to 'before'. So > > 'afetr/before' value lower than 1.0 means this patch increased > > efficiency over the baseline. And lower value means better efficiency. > > I would recommend to replace the after/end column with percentage i.e. > percentage improvement or degradation. Thank you for the nice suggestion. I will do so in the next version. > > > > > sz_batches before after after/before > > 1 102842895 106507398 1.03563204828102 > > 2 73364942 74529223 1.01586971880929 > > 4 58823633 51608504 0.877343022998937 > > 8 47532390 44820223 0.942940655834895 > > 16 43591587 36727177 0.842529018271347 > > 32 44207282 33946975 0.767904595446515 > > 64 41832437 26738286 0.639175910310939 > > 128 40278193 23262940 0.577556694263817 > > 256 41568533 22355103 0.537789077136785 > > 512 41626638 22822516 0.54826709762148 > > 1024 44440870 22676017 0.510251419470411 > > > > For <=2 batch size, tlb flushes batching shows no big difference but > > slight overhead. I think that's in an error range of this simple > > micro-benchmark, and therefore can be ignored. > > I would recommend to run the experiment multiple times and report > averages and standard deviation which will support your error range > claim. Again, good suggestion. I will do so. > > > Starting from batch size > > 4, however, tlb flushes batching shows clear efficiency gain. The > > efficiency gain tends to be proportional to the batch size, as expected. > > The efficiency gain ranges from about 13 percent with batch size 4, and > > up to 49 percent with batch size 1,024. > > > > Please note that this is a very simple microbenchmark, so real > > efficiency gain on real workload could be very different. > > > > I think you are running a single thread benchmark on a free machine. I > expect this series to be much more beneficial on loaded machine and for > multi-threaded applications. Your understanding of my test setup is correct and I agree to your expectation. > No need to test that scenario but if you > have already done that then it would be good to report. I don't have such test results or plans for those with specific timeline for now. I will share those if I get a chance, of course. Thanks, SJ
On Wed, 5 Mar 2025 22:36:45 +0200 Nadav Amit <nadav.amit@gmail.com> wrote: > > > On 5 Mar 2025, at 20:15, SeongJae Park <sj@kernel.org> wrote: > > > > For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes > > can happen for each vma of the given address ranges. Because such tlb > > flushes are for address ranges of same process, doing those in a batch > > is more efficient while still being safe. Modify madvise() and > > process_madvise() entry level code path to do such batched tlb flushes, > > while the internal unmap logics do only gathering of the tlb entries to > > flush. > > I made some related (similar?) patches in the past. You can see if you > find something useful in the discussion there. I think your version avoids > some of the “mistakes” I made. Thank you for sharing this! I sure I can learn and use many things from this work and previous discussions. > > [1] https://lore.kernel.org/all/20210926161259.238054-1-namit@vmware.com/T/#m23ccd29bad04a963c4d8c64ec3581f7c301c7806 Thanks, SJ