Message ID | d1cc9f12a8ad6c2a52cb600d93b06b064f2bbc57.1593205965.git.tim.c.chen@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Increase pagevec size on large system | expand |
On Fri, Jun 26, 2020 at 02:23:03PM -0700, Tim Chen wrote: > Enlarge the pagevec size to 31 to reduce LRU lock contention for > large systems. > > The LRU lock contention is reduced from 8.9% of total CPU cycles > to 2.2% of CPU cyles. And the pmbench throughput increases > from 88.8 Mpages/sec to 95.1 Mpages/sec. The downside here is that pagevecs are often stored on the stack (eg truncate_inode_pages_range()) as well as being used for the LRU list. On a 64-bit system, this increases the stack usage from 128 to 256 bytes for this array. I wonder if we could do something where we transform the ones on the stack to DECLARE_STACK_PAGEVEC(pvec), and similarly DECLARE_LRU_PAGEVEC the ones used for the LRUs. There's plenty of space in the header to add an unsigned char sz, delete PAGEVEC_SIZE and make it an variable length struct. Or maybe our stacks are now big enough that we just don't care. What do you think?
On Sat, 27 Jun 2020 04:13:04 +0100 Matthew Wilcox <willy@infradead.org> wrote: > On Fri, Jun 26, 2020 at 02:23:03PM -0700, Tim Chen wrote: > > Enlarge the pagevec size to 31 to reduce LRU lock contention for > > large systems. > > > > The LRU lock contention is reduced from 8.9% of total CPU cycles > > to 2.2% of CPU cyles. And the pmbench throughput increases > > from 88.8 Mpages/sec to 95.1 Mpages/sec. > > The downside here is that pagevecs are often stored on the stack (eg > truncate_inode_pages_range()) as well as being used for the LRU list. > On a 64-bit system, this increases the stack usage from 128 to 256 bytes > for this array. > > I wonder if we could do something where we transform the ones on the > stack to DECLARE_STACK_PAGEVEC(pvec), and similarly DECLARE_LRU_PAGEVEC > the ones used for the LRUs. There's plenty of space in the header to > add an unsigned char sz, delete PAGEVEC_SIZE and make it an variable > length struct. > > Or maybe our stacks are now big enough that we just don't care. > What do you think? And I wonder how useful CONFIG_NR_CPUS is for making this decision. Presumably a lot of general-purpose kernel builds have CONFIG_NR_CPUS much larger than the actual number of CPUs. I can't think of much of a fix for this, apart from making it larger on all kernels, Is there a downside to this?
On 6/26/20 8:47 PM, Andrew Morton wrote: > On Sat, 27 Jun 2020 04:13:04 +0100 Matthew Wilcox <willy@infradead.org> wrote: > >> On Fri, Jun 26, 2020 at 02:23:03PM -0700, Tim Chen wrote: >>> Enlarge the pagevec size to 31 to reduce LRU lock contention for >>> large systems. >>> >>> The LRU lock contention is reduced from 8.9% of total CPU cycles >>> to 2.2% of CPU cyles. And the pmbench throughput increases >>> from 88.8 Mpages/sec to 95.1 Mpages/sec. >> >> The downside here is that pagevecs are often stored on the stack (eg >> truncate_inode_pages_range()) as well as being used for the LRU list. >> On a 64-bit system, this increases the stack usage from 128 to 256 bytes >> for this array. >> >> I wonder if we could do something where we transform the ones on the >> stack to DECLARE_STACK_PAGEVEC(pvec), and similarly DECLARE_LRU_PAGEVEC >> the ones used for the LRUs. There's plenty of space in the header to >> add an unsigned char sz, delete PAGEVEC_SIZE and make it an variable >> length struct. >> >> Or maybe our stacks are now big enough that we just don't care. >> What do you think? > > And I wonder how useful CONFIG_NR_CPUS is for making this decision. > Presumably a lot of general-purpose kernel builds have CONFIG_NR_CPUS > much larger than the actual number of CPUs. > > I can't think of much of a fix for this, apart from making it larger on > all kernels, Is there a downside to this? > Thanks for Matthew and Andrew's feedbacks. I am okay with Matthew's suggestion of keeping the stack pagevec size unchanged. Andrew, do you have a preference? I was assuming that for people who really care about saving the kernel memory usage, they would make CONFIG_NR_CPUS small. I also have a hard time coming up with a better scheme. Otherwise, we will have to adjust the pagevec size when we actually found out how many CPUs we have brought online. It seems like a lot of added complexity for going that route. Tim
On Mon, 29 Jun 2020 09:57:42 -0700 Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > I am okay with Matthew's suggestion of keeping the stack pagevec size unchanged. > Andrew, do you have a preference? > > I was assuming that for people who really care about saving the kernel memory > usage, they would make CONFIG_NR_CPUS small. I also have a hard time coming > up with a better scheme. > > Otherwise, we will have to adjust the pagevec size when we actually > found out how many CPUs we have brought online. It seems like a lot > of added complexity for going that route. Even if we were to do this, the worst-case stack usage on the largest systems might be an issue. If it isn't then we might as well hard-wire it to 31 elements anyway, I dunno. An extra 128 bytes of stack doesn't sound toooo bad, and the performance benefit is significant. Perhaps we just go with the original patch. If there are any on-stack pagevecs in the page reclaim path then perhaps we could create a new mini-pagevec for just those. or look at simply removing the pagevec optimization in there altogether.
On Tue 30-06-20 17:27:13, Andrew Morton wrote: > On Mon, 29 Jun 2020 09:57:42 -0700 Tim Chen <tim.c.chen@linux.intel.com> wrote: > > I am okay with Matthew's suggestion of keeping the stack pagevec size unchanged. > > Andrew, do you have a preference? > > > > I was assuming that for people who really care about saving the kernel memory > > usage, they would make CONFIG_NR_CPUS small. I also have a hard time coming > > up with a better scheme. > > > > Otherwise, we will have to adjust the pagevec size when we actually > > found out how many CPUs we have brought online. It seems like a lot > > of added complexity for going that route. > > Even if we were to do this, the worst-case stack usage on the largest > systems might be an issue. If it isn't then we might as well hard-wire > it to 31 elements anyway, I am not sure this is really a matter of how large the machine is. For example in the writeout paths this really depends on how complex the IO stack is much more. Direct memory reclaim is also a very sensitive stack context. As we are not doing any writeout anymore I believe a large part of the on stack fs usage is not really relevant. There seem to be only few on stack users inside mm and they shouldn't be part of the memory reclaim AFAICS. I have simply did $ git grep "^[[:space:]]*struct pagevec[[:space:]][^*]" and fortunately there weren't that many hits to get an idea about the usage. There is some usage in the graphic stack that should be double check though. Btw. I think that pvec is likely a suboptimal data structure for many on stack users. It allows only very few slots to batch. Something like mmu_gather which can optimistically increase the batch sounds like something that would be worth The main question is whether the improvement is visible on any non-artificial workloads. If yes then the quick fix is likely the best way forward. If this is mostly a microbench thingy then I would be happier to see a more longterm solution. E.g. scale pcp pagevec sizes on the machine size or even use something better than pvec (e.g. lru_deactivate_file could scale much more and I am not sure pcp aspect is really improving anything - why don't we simply invalidate all gathered pages at once at the end of invalidate_mapping_pages?).
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h index 081d934eda64..466ebcdd190d 100644 --- a/include/linux/pagevec.h +++ b/include/linux/pagevec.h @@ -11,8 +11,16 @@ #include <linux/xarray.h> +#if CONFIG_NR_CPUS > 64 +/* + * Use larger size to reduce lru lock contention on large system. + * 31 pointers + header align the pagevec structure to a power of two + */ +#define PAGEVEC_SIZE 31 +#else /* 15 pointers + header align the pagevec structure to a power of two */ #define PAGEVEC_SIZE 15 +#endif struct page; struct address_space;
Pages to be added to a LRU are first cached in pagevec before being added to LRU in a batch via pagevec_lru_move_fn. By adding the pages in a batch with pagevec, the contention on LRU lock is mitigated. Currently the pagevec size is defined to be 15. We found during testing on a large SMT system with 2 sockets, 48 cores and 96 CPU threads, the pagevec size of 15 is too small for workload that caused frequent page additions to LRU. With pmbench, 8.9% of the CPU cycles are spent contending for the LRU lock. 12.92% pmbench [kernel.kallsyms] [k] queued_spin_lock_slowpath | --12.92%--0x5555555582f2 | --12.92%--page_fault do_page_fault __do_page_fault handle_mm_fault __handle_mm_fault | |--8.90%--__lru_cache_add | pagevec_lru_move_fn | | | --8.90%--_raw_spin_lock_irqsave | queued_spin_lock_slowpat Enlarge the pagevec size to 31 to reduce LRU lock contention for large systems. The LRU lock contention is reduced from 8.9% of total CPU cycles to 2.2% of CPU cyles. And the pmbench throughput increases from 88.8 Mpages/sec to 95.1 Mpages/sec. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> --- include/linux/pagevec.h | 8 ++++++++ 1 file changed, 8 insertions(+)