Message ID | 20231030072540.38631-1-byungchul@sk.com (mailing list archive) |
---|---|
Headers | show |
Series | Reduce TLB flushes under some specific conditions | expand |
On 10/30/23 00:25, Byungchul Park wrote: > I'm suggesting a mechanism to reduce TLB flushes by keeping source and > destination of folios participated in the migrations until all TLB > flushes required are done, only if those folios are not mapped with > write permission PTE entries at all. I worked Based on v6.6-rc5. There's a lot of common overhead here, on top of the complexity in general: * A new page flag * A new cpumask_t in task_struct * A new zone list * Extra (temporary) memory consumption and the benefits are ... "performance improved a little bit" on one workload. That doesn't seem like a good overall tradeoff to me. There will certainly be workloads that, before this patch, would have little or no memory pressure and after this patch would need to do reclaim. Also, looking with my arch/x86 hat on, there's really nothing arch-specific here. Please try to keep stuff out of arch/x86 unless it's very much arch-specific. The connection between the arch-generic TLB flushing and __flush_tlb_local() seems quite tenuous. __flush_tlb_local() is, to me, quite deep in the implementation and there are quite a few ways that a random TLB flush might not end up there. In other words, I'm not saying that this is broken, but it's not clear at all to me how it functions reliably.
> On Oct 30, 2023, at 7:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > !! External Email > > On 10/30/23 00:25, Byungchul Park wrote: >> I'm suggesting a mechanism to reduce TLB flushes by keeping source and >> destination of folios participated in the migrations until all TLB >> flushes required are done, only if those folios are not mapped with >> write permission PTE entries at all. I worked Based on v6.6-rc5. > > There's a lot of common overhead here, on top of the complexity in general: > > * A new page flag > * A new cpumask_t in task_struct > * A new zone list > * Extra (temporary) memory consumption > > and the benefits are ... "performance improved a little bit" on one > workload. That doesn't seem like a good overall tradeoff to me. I almost forgot that I did (and embarrassingly did not follow) a TLB flush deferring mechanism mechanism before [*], which was relatively generic. I did not look at the migration case, but it could have been relatively easily added - I think. Feel free to plagiarize if you find it suitable. Note that some of the patch-set is not relevant (e.g., 20/20 has already been fixed, 3/20 was merged.) [*] https://lore.kernel.org/linux-mm/20210131001132.3368247-1-namit@vmware.com/
On Mon, Oct 30, 2023 at 10:55:07AM -0700, Dave Hansen wrote: > On 10/30/23 00:25, Byungchul Park wrote: > > I'm suggesting a mechanism to reduce TLB flushes by keeping source and > > destination of folios participated in the migrations until all TLB > > flushes required are done, only if those folios are not mapped with > > write permission PTE entries at all. I worked Based on v6.6-rc5. > > There's a lot of common overhead here, on top of the complexity in general: > > * A new page flag > * A new cpumask_t in task_struct > * A new zone list > * Extra (temporary) memory consumption > > and the benefits are ... "performance improved a little bit" on one > workload. That doesn't seem like a good overall tradeoff to me. > > There will certainly be workloads that, before this patch, would have > little or no memory pressure and after this patch would need to do reclaim. 'if (gain - cost) > 0 ?'" is a difficult problem. I think the followings are already big benefit in general: 1. big reduction of IPIs # 2. big reduction of TLB flushes # 3. big reduction of TLB misses # Of course, I or we need to keep trying to see a better number in end-to-end performance. > Also, looking with my arch/x86 hat on, there's really nothing > arch-specific here. Please try to keep stuff out of arch/x86 unless > it's very much arch-specific. Okay. I will try to keep it out of arch code. I should give up an optimization that can be achieved by working on arch code tho. Byungchul
On Mon, Oct 30, 2023 at 10:55:07AM -0700, Dave Hansen wrote: > On 10/30/23 00:25, Byungchul Park wrote: > > I'm suggesting a mechanism to reduce TLB flushes by keeping source and > > destination of folios participated in the migrations until all TLB > > flushes required are done, only if those folios are not mapped with > > write permission PTE entries at all. I worked Based on v6.6-rc5. > > There's a lot of common overhead here, on top of the complexity in general: > > * A new page flag > * A new cpumask_t in task_struct > * A new zone list > * Extra (temporary) memory consumption > > and the benefits are ... "performance improved a little bit" on one > workload. That doesn't seem like a good overall tradeoff to me. I tested it under limited conditions to get stable results e.g. not to use hyper-thread, dedicate cpu times to the test and so on. However, I'm conviced that this patch set is more worth developing than you think it is. Lemme share the results I've just got after changing # of CPUs participated in the test, 16 -> 80, in the system with 80 CPUs. This is just for your information - not that stable tho. Byungchul --- Architecture - x86_64 QEMU - kvm enabled, host cpu Numa - 2 nodes (80 CPUs 1GB, no CPUs 8GB) Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled Benchmark - XSBench -p 50000000 (-p option makes the runtime longer) mainline kernel =============== The 1st try) ===================================== Threads: 64 Runtime: 233.118 seconds ===================================== numa_pages_migrated 758334 pgmigrate_success 1724964 nr_tlb_remote_flush 305706 nr_tlb_remote_flush_received 18598543 nr_tlb_local_flush_all 19092 nr_tlb_local_flush_one 4518717 The 2nd try) ===================================== Threads: 64 Runtime: 221.725 seconds ===================================== numa_pages_migrated 633209 pgmigrate_success 2156509 nr_tlb_remote_flush 261977 nr_tlb_remote_flush_received 14289256 nr_tlb_local_flush_all 11738 nr_tlb_local_flush_one 4520317 mainline kernel + migrc ======================= The 1st try) ===================================== Threads: 64 Runtime: 212.522 seconds ==================================== numa_pages_migrated 901264 pgmigrate_success 1990814 nr_tlb_remote_flush 151280 nr_tlb_remote_flush_received 9031376 nr_tlb_local_flush_all 21208 nr_tlb_local_flush_one 4519595 The 2nd try) ===================================== Threads: 64 Runtime: 204.410 seconds ==================================== numa_pages_migrated 929260 pgmigrate_success 2729868 nr_tlb_remote_flush 166722 nr_tlb_remote_flush_received 8238273 nr_tlb_local_flush_all 13717 nr_tlb_local_flush_one 4519582
On 30.10.23 23:55, Byungchul Park wrote: > On Mon, Oct 30, 2023 at 10:55:07AM -0700, Dave Hansen wrote: >> On 10/30/23 00:25, Byungchul Park wrote: >>> I'm suggesting a mechanism to reduce TLB flushes by keeping source and >>> destination of folios participated in the migrations until all TLB >>> flushes required are done, only if those folios are not mapped with >>> write permission PTE entries at all. I worked Based on v6.6-rc5. >> >> There's a lot of common overhead here, on top of the complexity in general: >> >> * A new page flag >> * A new cpumask_t in task_struct >> * A new zone list >> * Extra (temporary) memory consumption >> >> and the benefits are ... "performance improved a little bit" on one >> workload. That doesn't seem like a good overall tradeoff to me. >> >> There will certainly be workloads that, before this patch, would have >> little or no memory pressure and after this patch would need to do reclaim. > > 'if (gain - cost) > 0 ?'" is a difficult problem. I think the followings > are already big benefit in general: > > 1. big reduction of IPIs # > 2. big reduction of TLB flushes # > 3. big reduction of TLB misses # > > Of course, I or we need to keep trying to see a better number in > end-to-end performance. You'll have to show convincing, real numbers, for use cases people care about, to even motivate why people should consider looking at this in more detail. If you can't measure it and only speculate, nobody cares. The numbers you provided were so far not convincing, and it's questionable if the single benchmark you are presenting represents a reasonable real workload that ends up improving *real* workloads. A better description of the whole benchmark and why it represents a real workload behavior might help.