Message ID | 20250310172318.653630-1-sj@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE | expand |
On Mon, 10 Mar 2025 10:23:09 -0700 SeongJae Park <sj@kernel.org> wrote: > It is unclear if such use case > is common and the inefficiency is significant. Well, we could conduct a survey, Can you add some logging to detect when userspace performs such an madvise() call, then run that kernel on some "typical" machines which are running "typical" workloads? That should give us a feeling for how often userspace does this, and hence will help us understand the usefulness of this patchset.
On Mon, Mar 10, 2025 at 03:39:21PM -0700, Andrew Morton wrote: > On Mon, 10 Mar 2025 10:23:09 -0700 SeongJae Park <sj@kernel.org> wrote: > > > It is unclear if such use case > > is common and the inefficiency is significant. > > Well, we could conduct a survey, > > Can you add some logging to detect when userspace performs such an > madvise() call, then run that kernel on some "typical" machines which > are running "typical" workloads? That should give us a feeling for how > often userspace does this, and hence will help us understand the usefulness > of this patchset. Just for the clarification, this patchset is very useful for the process_madvise() and the experiment results show that. I think what SJ wants to say is specific to madvise() syscall that this change might or might not be that helpful. It will be helpful if the user application is madvising regions comprising of multiple vmas. However this patchset is very very useful for process_madvise().
On Mon, 10 Mar 2025 15:39:21 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 10 Mar 2025 10:23:09 -0700 SeongJae Park <sj@kernel.org> wrote: > > > It is unclear if such use case > > is common and the inefficiency is significant. > > Well, we could conduct a survey, > > Can you add some logging to detect when userspace performs such an > madvise() call, then run that kernel on some "typical" machines which > are running "typical" workloads? That should give us a feeling for how > often userspace does this, I agree that could make this patch series more informative. > and hence will help us understand the usefulness > of this patchset. Nevertheless, what this patchset is really trying to optimize is not the madvise() use case, but process_madvise() use. I believe the usage is apparently common, given the vectorization based semantic of process_madvise(). Jemalloc is also adding[1] that kind of use case. And the benefit is clear, given the microbenchmark results that I shared. Also, this patchset shouldn't introduce new regression to madvise(). Hence, I think the survey can be interestign and helpful, but shouldn't be a blocker for this patch series. Coudl you please let me know if I'm missing something and if you still want the survey? [1] https://github.com/jemalloc/jemalloc/pull/2794 Thanks, SJ
On Mon, Mar 10, 2025 at 04:15:06PM -0700, Shakeel Butt wrote: > On Mon, Mar 10, 2025 at 03:39:21PM -0700, Andrew Morton wrote: > > On Mon, 10 Mar 2025 10:23:09 -0700 SeongJae Park <sj@kernel.org> wrote: > > > > > It is unclear if such use case > > > is common and the inefficiency is significant. > > > > Well, we could conduct a survey, > > > > Can you add some logging to detect when userspace performs such an > > madvise() call, then run that kernel on some "typical" machines which > > are running "typical" workloads? That should give us a feeling for how > > often userspace does this, and hence will help us understand the usefulness > > of this patchset. > > Just for the clarification, this patchset is very useful for the > process_madvise() and the experiment results show that. +1 Google carried an internal version for a vectorized madvise() which was much faster than process_madvise() last time I measured it. I hope SJ's patchset will (partially) address this difference, which will hopefully allow to drop the internal implementation for process_madvise.
On Mon, Mar 10, 2025 at 11:36:52PM +0000, Roman Gushchin wrote: > On Mon, Mar 10, 2025 at 04:15:06PM -0700, Shakeel Butt wrote: > > On Mon, Mar 10, 2025 at 03:39:21PM -0700, Andrew Morton wrote: > > > On Mon, 10 Mar 2025 10:23:09 -0700 SeongJae Park <sj@kernel.org> wrote: > > > > > > > It is unclear if such use case > > > > is common and the inefficiency is significant. > > > > > > Well, we could conduct a survey, > > > > > > Can you add some logging to detect when userspace performs such an > > > madvise() call, then run that kernel on some "typical" machines which > > > are running "typical" workloads? That should give us a feeling for how > > > often userspace does this, and hence will help us understand the usefulness > > > of this patchset. > > > > Just for the clarification, this patchset is very useful for the > > process_madvise() and the experiment results show that. > > +1 > > Google carried an internal version for a vectorized madvise() which > was much faster than process_madvise() last time I measured it. > I hope SJ's patchset will (partially) address this difference, > which will hopefully allow to drop the internal implementation > for process_madvise. Relatedly I also feel, at some point, we ought to remove the UIO_FASTIOV limit on process_madvise(). But one for a future series...
The below commit message talks about tlb bits, but you spend a lot of this series refactoring mm/madvise.c. Can we just separate out the two into separate series please? I am doing the same kind of thing with mremap() at the moment, but sent the refactor _first_ before the work that builds upon it :) Your refactoring is great, so I want to be able to take that (and Andrew's objections obviously don't apply there), and then we can address tlb stuff separately and in a more focused way. On Mon, Mar 10, 2025 at 10:23:09AM -0700, SeongJae Park wrote: > When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or > MADV_FREE with multiple address ranges, tlb flushes happen for each of > the given address ranges. Because such tlb flushes are for same > process, doing those in a batch is more efficient while still being > safe. Modify process_madvise() entry level code path to do such batched > tlb flushes, while the internal unmap logics do only gathering of the > tlb entries to flush. > > In more detail, modify the entry functions to initialize an mmu_gather > ojbect and pass it to the internal logics. And make the internal logics > do only gathering of the tlb entries to flush into the received > mmu_gather object. After all internal function calls are done, the > entry functions flush the gathered tlb entries at once. > > The inefficiency should be smaller on madvise() use case, since it > receives only a single address range. But if there are multiple vmas > for the range, same problem can happen. It is unclear if such use case > is common and the inefficiency is significant. Make the change for > madivse(), too, since it doesn't really change madvise() internal > behavior while helps keeping the code that shared between > process_madvise() and madvise() internal logics clean. > > Patches Seuquence > ================= > > First four patches are minor cleanups of madvise.c for readability. > > Fifth patch defines new data structure for managing information > that required for batched tlb flushes (mmu_gather and behavior), and > update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling > internal logics to receive it. > > Sixth and seventh patches make internal logics for handling > MADV_DONTNEED[_LOCKED] MADV_FREE be ready for batched tlb flushing. The > patches keep the support of unbatched tlb flushes use case, for > fine-grained and safe transitions. > > Eighth patch updates madvise() and process_madvise() code to do the > batched tlb flushes utilizing the previous patches introduced changes. > > The final ninth patch removes the internal logics' unbatched tlb flushes > use case support code, which is no more be used. > > Test Results > ============ > > I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory > using multiple process_madvise() calls. I apply the advice in 4 KiB > sized regions granularity, but with varying batch size per > process_madvise() call (vlen) from 1 to 1024. The source code for the > measurement is available at GitHub[1]. To reduce measurement errors, I > did the measurement five times. > > The measurement results are as below. 'sz_batch' column shows the batch > size of process_madvise() calls. 'Before' and 'After' columns show the > average of latencies in nanoseconds that measured five times on kernels > that built without and with the tlb flushes batching patch of this > series, respectively. For the baseline, mm-unstable tree of > 2025-03-07[2] has been used. 'B-stdev' and 'A-stdev' columns show > ratios of latency measurements standard deviation to average in percent > for 'Before' and 'After', respectively. 'Latency_reduction' shows the > reduction of the latency that the commit has achieved, in percent. > Higher 'Latency_reduction' values mean more efficiency improvements. > > sz_batch Before B-stdev After A-stdev Latency_reduction > 1 128691595.4 6.09 106878698.4 2.76 16.95 > 2 94046750.8 3.30 68691907 2.79 26.96 > 4 80538496.8 5.35 50230673.8 5.30 37.63 > 8 72672225.2 5.50 43918112 3.54 39.57 > 16 66955104.4 4.25 36549941.2 1.62 45.41 > 32 65814679 5.89 33060291 3.30 49.77 > 64 65099205.2 2.69 26003756.4 1.56 60.06 > 128 62591307.2 4.02 24008180.4 1.61 61.64 > 256 64124368.6 2.93 23868856 2.20 62.78 > 512 62325618 5.72 23311330.6 1.74 62.60 > 1024 64802138.4 5.05 23651695.2 3.40 63.50 > > Interestingly, the latency has reduced (improved) even with batch size > 1. I think some of compiler optimizations have affected that, like also > observed with the previous process_madvise() mmap_lock optimization > patch sereis[3]. > > So, let's focus on the proportion between the improvement and the batch > size. As expected, tlb flushes batching provides latency reduction that > proportional to the batch size. The efficiency gain ranges from about > 27 percent with batch size 2, and up to 63 percent with batch size > 1,024. > > Please note that this is a very simple microbenchmark, so real > efficiency gain on real workload could be very different. > > Changes from RFC > (https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org) > - Clarify motivation of the change on the cover letter > - Add average and stdev of evaluation results > - Show latency reduction on evaluation results > - Fix !CONFIG_MEMORY_FAILURE build error > - Rename is_memory_populate() to is_madvise_populate() > - Squash patches 5-8 > - Add kerneldoc for unmap_vm_area_struct() > - Squash patches 10 and 11 > - Squash patches 12-14 > - Squash patches 15 and 16 > > References > ========== > > [1] https://github.com/sjp38/eval_proc_madvise > [2] commit e664d7d28a7c ("selftest: test system mappings are sealed") # mm-unstable > [3] https://lore.kernel.org/20250211182833.4193-1-sj@kernel.org > > SeongJae Park (9): > mm/madvise: use is_memory_failure() from madvise_do_behavior() > mm/madvise: split out populate behavior check logic > mm/madvise: deduplicate madvise_do_behavior() skip case handlings > mm/madvise: remove len parameter of madvise_do_behavior() > mm/madvise: define and use madvise_behavior struct for > madvise_do_behavior() > mm/memory: split non-tlb flushing part from zap_page_range_single() > mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches > tlb flushes > mm/madvise: batch tlb flushes for > [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE}) > mm/madvise: remove !tlb support from > madvise_{dontneed,free}_single_vma() > > mm/internal.h | 3 + > mm/madvise.c | 221 +++++++++++++++++++++++++++++++++----------------- > mm/memory.c | 38 ++++++--- > 3 files changed, 176 insertions(+), 86 deletions(-) > > > base-commit: e993f5f5b0ac851cf60578cfee5488031dfaa80c > -- > 2.39.5
On Tue, 11 Mar 2025 12:49:38 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > The below commit message talks about tlb bits, but you spend a lot of this > series refactoring mm/madvise.c. > > Can we just separate out the two into separate series please? > > I am doing the same kind of thing with mremap() at the moment, but sent the > refactor _first_ before the work that builds upon it :) > > Your refactoring is great, so I want to be able to take that (and Andrew's > objections obviously don't apply there), and then we can address tlb stuff > separately and in a more focused way. Thank you for the nice suggestion and I agree. I will separate those in the next spin. Thanks, SJ [...]