mm: Increase pagevec size on large system

Message ID	d1cc9f12a8ad6c2a52cb600d93b06b064f2bbc57.1593205965.git.tim.c.chen@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=/w+2=AH=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6B6D3206BE IronPort-SDR: 9GvmpfKKYkLEWFn4QgS1vcdl0C1S7STdbu0HGSHSMUbLLSt8oiAIorH0tN5BGZUGe4pdFz2XZE bsaRTSyRceVg== IronPort-SDR: ZRK/Scww/abTcQTZCtwAphfFzhtSJkBRgtZAxUb2dx6iS2Om0bFchbahkgMl53kPUpZvPDt6/8 HI26odnZzifw== From: Tim Chen <tim.c.chen@linux.intel.com> To: Andrew Morton <akpm@linux-foundation.org>, Vladimir Davydov <vdavydov@virtuozzo.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@suse.cz> Cc: Tim Chen <tim.c.chen@linux.intel.com>, Dave Hansen <dave.hansen@intel.com>, Ying Huang <ying.huang@intel.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [Patch] mm: Increase pagevec size on large system Date: Fri, 26 Jun 2020 14:23:03 -0700 Message-Id: <d1cc9f12a8ad6c2a52cb600d93b06b064f2bbc57.1593205965.git.tim.c.chen@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Increase pagevec size on large system \| expand mm: Increase pagevec size on large system

Tim Chen June 26, 2020, 9:23 p.m. UTC

Pages to be added to a LRU are first cached in pagevec before being
added to LRU in a batch via pagevec_lru_move_fn.  By adding the pages
in a batch with pagevec, the contention on LRU lock is mitigated.
Currently the pagevec size is defined to be 15.

We found during testing on a large SMT system with 2 sockets, 48 cores
and 96 CPU threads, the pagevec size of 15 is too small for workload
that caused frequent page additions to LRU.

With pmbench, 8.9% of the CPU cycles are spent contending
for the LRU lock.

   12.92%  pmbench          [kernel.kallsyms]         [k] queued_spin_lock_slowpath
            |
             --12.92%--0x5555555582f2
                       |
                        --12.92%--page_fault
                                  do_page_fault
                                  __do_page_fault
                                  handle_mm_fault
                                  __handle_mm_fault
                                  |
                                  |--8.90%--__lru_cache_add
                                  |          pagevec_lru_move_fn
                                  |          |
                                  |           --8.90%--_raw_spin_lock_irqsave
                                  |                     queued_spin_lock_slowpat

Enlarge the pagevec size to 31 to reduce LRU lock contention for
large systems.

The LRU lock contention is reduced from 8.9% of total CPU cycles
to 2.2% of CPU cyles.  And the pmbench throughput increases
from 88.8 Mpages/sec to 95.1 Mpages/sec.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/pagevec.h | 8 ++++++++
 1 file changed, 8 insertions(+)

Matthew Wilcox (Oracle) June 27, 2020, 3:13 a.m. UTC | #1

On Fri, Jun 26, 2020 at 02:23:03PM -0700, Tim Chen wrote:
> Enlarge the pagevec size to 31 to reduce LRU lock contention for
> large systems.
> 
> The LRU lock contention is reduced from 8.9% of total CPU cycles
> to 2.2% of CPU cyles.  And the pmbench throughput increases
> from 88.8 Mpages/sec to 95.1 Mpages/sec.

The downside here is that pagevecs are often stored on the stack (eg
truncate_inode_pages_range()) as well as being used for the LRU list.
On a 64-bit system, this increases the stack usage from 128 to 256 bytes
for this array.

I wonder if we could do something where we transform the ones on the
stack to DECLARE_STACK_PAGEVEC(pvec), and similarly DECLARE_LRU_PAGEVEC
the ones used for the LRUs.  There's plenty of space in the header to
add an unsigned char sz, delete PAGEVEC_SIZE and make it an variable
length struct.

Or maybe our stacks are now big enough that we just don't care.
What do you think?

Andrew Morton June 27, 2020, 3:47 a.m. UTC | #2

On Sat, 27 Jun 2020 04:13:04 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Jun 26, 2020 at 02:23:03PM -0700, Tim Chen wrote:
> > Enlarge the pagevec size to 31 to reduce LRU lock contention for
> > large systems.
> > 
> > The LRU lock contention is reduced from 8.9% of total CPU cycles
> > to 2.2% of CPU cyles.  And the pmbench throughput increases
> > from 88.8 Mpages/sec to 95.1 Mpages/sec.
> 
> The downside here is that pagevecs are often stored on the stack (eg
> truncate_inode_pages_range()) as well as being used for the LRU list.
> On a 64-bit system, this increases the stack usage from 128 to 256 bytes
> for this array.
> 
> I wonder if we could do something where we transform the ones on the
> stack to DECLARE_STACK_PAGEVEC(pvec), and similarly DECLARE_LRU_PAGEVEC
> the ones used for the LRUs.  There's plenty of space in the header to
> add an unsigned char sz, delete PAGEVEC_SIZE and make it an variable
> length struct.
> 
> Or maybe our stacks are now big enough that we just don't care.
> What do you think?

And I wonder how useful CONFIG_NR_CPUS is for making this decision. 
Presumably a lot of general-purpose kernel builds have CONFIG_NR_CPUS
much larger than the actual number of CPUs.

I can't think of much of a fix for this, apart from making it larger on
all kernels, Is there a downside to this?

Tim Chen June 29, 2020, 4:57 p.m. UTC | #3

On 6/26/20 8:47 PM, Andrew Morton wrote:
> On Sat, 27 Jun 2020 04:13:04 +0100 Matthew Wilcox <willy@infradead.org> wrote:
> 
>> On Fri, Jun 26, 2020 at 02:23:03PM -0700, Tim Chen wrote:
>>> Enlarge the pagevec size to 31 to reduce LRU lock contention for
>>> large systems.
>>>
>>> The LRU lock contention is reduced from 8.9% of total CPU cycles
>>> to 2.2% of CPU cyles.  And the pmbench throughput increases
>>> from 88.8 Mpages/sec to 95.1 Mpages/sec.
>>
>> The downside here is that pagevecs are often stored on the stack (eg
>> truncate_inode_pages_range()) as well as being used for the LRU list.
>> On a 64-bit system, this increases the stack usage from 128 to 256 bytes
>> for this array.
>>
>> I wonder if we could do something where we transform the ones on the
>> stack to DECLARE_STACK_PAGEVEC(pvec), and similarly DECLARE_LRU_PAGEVEC
>> the ones used for the LRUs.  There's plenty of space in the header to
>> add an unsigned char sz, delete PAGEVEC_SIZE and make it an variable
>> length struct.
>>
>> Or maybe our stacks are now big enough that we just don't care.
>> What do you think?
> 
> And I wonder how useful CONFIG_NR_CPUS is for making this decision. 
> Presumably a lot of general-purpose kernel builds have CONFIG_NR_CPUS
> much larger than the actual number of CPUs.
> 
> I can't think of much of a fix for this, apart from making it larger on
> all kernels, Is there a downside to this?
> 

Thanks for Matthew and Andrew's feedbacks.

I am okay with Matthew's suggestion of keeping the stack pagevec size unchanged.
Andrew, do you have a preference?

I was assuming that for people who really care about saving the kernel memory
usage, they would make CONFIG_NR_CPUS small. I also have a hard time coming
up with a better scheme.

Otherwise, we will have to adjust the pagevec size when we actually 
found out how many CPUs we have brought online.  It seems like a lot
of added complexity for going that route.

Tim

Andrew Morton July 1, 2020, 12:27 a.m. UTC | #4

On Mon, 29 Jun 2020 09:57:42 -0700 Tim Chen <tim.c.chen@linux.intel.com> wrote:

> 
> 
> I am okay with Matthew's suggestion of keeping the stack pagevec size unchanged.
> Andrew, do you have a preference?
> 
> I was assuming that for people who really care about saving the kernel memory
> usage, they would make CONFIG_NR_CPUS small. I also have a hard time coming
> up with a better scheme.
> 
> Otherwise, we will have to adjust the pagevec size when we actually 
> found out how many CPUs we have brought online.  It seems like a lot
> of added complexity for going that route.

Even if we were to do this, the worst-case stack usage on the largest
systems might be an issue.  If it isn't then we might as well hard-wire
it to 31 elements anyway,

I dunno.  An extra 128 bytes of stack doesn't sound toooo bad, and the
performance benefit is significant.  Perhaps we just go with the
original patch.  If there are any on-stack pagevecs in the page reclaim
path then perhaps we could create a new mini-pagevec for just those.  or
look at simply removing the pagevec optimization in there altogether.

Michal Hocko July 1, 2020, 10:05 a.m. UTC | #5

On Tue 30-06-20 17:27:13, Andrew Morton wrote:
> On Mon, 29 Jun 2020 09:57:42 -0700 Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > I am okay with Matthew's suggestion of keeping the stack pagevec size unchanged.
> > Andrew, do you have a preference?
> > 
> > I was assuming that for people who really care about saving the kernel memory
> > usage, they would make CONFIG_NR_CPUS small. I also have a hard time coming
> > up with a better scheme.
> > 
> > Otherwise, we will have to adjust the pagevec size when we actually 
> > found out how many CPUs we have brought online.  It seems like a lot
> > of added complexity for going that route.
> 
> Even if we were to do this, the worst-case stack usage on the largest
> systems might be an issue.  If it isn't then we might as well hard-wire
> it to 31 elements anyway,

I am not sure this is really a matter of how large the machine is. For
example in the writeout paths this really depends on how complex the IO
stack is much more.

Direct memory reclaim is also a very sensitive stack context. As we are
not doing any writeout anymore I believe a large part of the on stack fs
usage is not really relevant. There seem to be only few on stack users
inside mm and they shouldn't be part of the memory reclaim AFAICS.
I have simply did
$ git grep "^[[:space:]]*struct pagevec[[:space:]][^*]"
and fortunately there weren't that many hits to get an idea about the
usage. There is some usage in the graphic stack that should be double
check though.

Btw. I think that pvec is likely a suboptimal data structure for many
on stack users. It allows only very few slots to batch. Something like
mmu_gather which can optimistically increase the batch sounds like
something that would be worth

The main question is whether the improvement is  visible on any
non-artificial workloads. If yes then the quick fix
is likely the best way forward. If this is mostly a microbench thingy
then I would be happier to see a more longterm solution. E.g. scale
pcp pagevec sizes on the machine size or even use something better than
pvec (e.g. lru_deactivate_file could scale much more and I am not sure
pcp aspect is really improving anything - why don't we simply invalidate
all gathered pages at once at the end of invalidate_mapping_pages?).

mm: Increase pagevec size on large system

Commit Message

Comments

Patch