mbox series

[0/6] make memblock allocator utilize the node's fallback info

Message ID 1551011649-30103-1-git-send-email-kernelfans@gmail.com (mailing list archive)
Headers show
Series make memblock allocator utilize the node's fallback info | expand

Message

Pingfan Liu Feb. 24, 2019, 12:34 p.m. UTC
There are NUMA machines with memory-less node. At present page allocator builds the
full fallback info by build_zonelists(). But memblock allocator does not utilize
this info. And for memory-less node, memblock allocator just falls back "node 0",
without utilizing the nearest node. Unfortunately, the percpu section is allocated 
by memblock, which is accessed frequently after bootup.

This series aims to improve the performance of per cpu section on memory-less node
by feeding node's fallback info to memblock allocator on x86, like we do for page
allocator. On other archs, it requires independent effort to setup node to cpumask
map ahead.


CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Vlastimil Babka <vbabka@suse.cz>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Andi Kleen <ak@linux.intel.com>
CC: Petr Tesarik <ptesarik@suse.cz>
CC: Michal Hocko <mhocko@suse.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Nicholas Piggin <npiggin@gmail.com>
CC: Daniel Vacek <neelx@redhat.com>
CC: linux-kernel@vger.kernel.org

Pingfan Liu (6):
  mm/numa: extract the code of building node fall back list
  mm/memblock: make full utilization of numa info
  x86/numa: define numa_init_array() conditional on CONFIG_NUMA
  x86/numa: concentrate the code of setting cpu to node map
  x86/numa: push forward the setup of node to cpumask map
  x86/numa: build node fallback info after setting up node to cpumask
    map

 arch/x86/include/asm/topology.h |  4 ---
 arch/x86/kernel/setup.c         |  2 ++
 arch/x86/kernel/setup_percpu.c  |  3 --
 arch/x86/mm/numa.c              | 40 +++++++++++-------------
 include/linux/memblock.h        |  3 ++
 mm/memblock.c                   | 68 ++++++++++++++++++++++++++++++++++++++---
 mm/page_alloc.c                 | 48 +++++++++++++++++------------
 7 files changed, 114 insertions(+), 54 deletions(-)

Comments

Michal Hocko Feb. 25, 2019, 4:03 p.m. UTC | #1
On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> There are NUMA machines with memory-less node. At present page allocator builds the
> full fallback info by build_zonelists(). But memblock allocator does not utilize
> this info. And for memory-less node, memblock allocator just falls back "node 0",
> without utilizing the nearest node. Unfortunately, the percpu section is allocated 
> by memblock, which is accessed frequently after bootup.
> 
> This series aims to improve the performance of per cpu section on memory-less node
> by feeding node's fallback info to memblock allocator on x86, like we do for page
> allocator. On other archs, it requires independent effort to setup node to cpumask
> map ahead.

Do you have any numbers to tell us how much does this improve the
situation?
Pingfan Liu Feb. 26, 2019, 5:47 a.m. UTC | #2
On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> > There are NUMA machines with memory-less node. At present page allocator builds the
> > full fallback info by build_zonelists(). But memblock allocator does not utilize
> > this info. And for memory-less node, memblock allocator just falls back "node 0",
> > without utilizing the nearest node. Unfortunately, the percpu section is allocated
> > by memblock, which is accessed frequently after bootup.
> >
> > This series aims to improve the performance of per cpu section on memory-less node
> > by feeding node's fallback info to memblock allocator on x86, like we do for page
> > allocator. On other archs, it requires independent effort to setup node to cpumask
> > map ahead.
>
> Do you have any numbers to tell us how much does this improve the
> situation?

Not yet. At present just based on the fact that we prefer to allocate
per cpu area on local node.

Thanks,
Pingfan
Michal Hocko Feb. 26, 2019, 12:09 p.m. UTC | #3
On Tue 26-02-19 13:47:37, Pingfan Liu wrote:
> On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> > > There are NUMA machines with memory-less node. At present page allocator builds the
> > > full fallback info by build_zonelists(). But memblock allocator does not utilize
> > > this info. And for memory-less node, memblock allocator just falls back "node 0",
> > > without utilizing the nearest node. Unfortunately, the percpu section is allocated
> > > by memblock, which is accessed frequently after bootup.
> > >
> > > This series aims to improve the performance of per cpu section on memory-less node
> > > by feeding node's fallback info to memblock allocator on x86, like we do for page
> > > allocator. On other archs, it requires independent effort to setup node to cpumask
> > > map ahead.
> >
> > Do you have any numbers to tell us how much does this improve the
> > situation?
> 
> Not yet. At present just based on the fact that we prefer to allocate
> per cpu area on local node.

Yes, we _usually_ do. But the additional complexity should be worth it.
And if we find out that the final improvement is not all that great and
considering that memory-less setups are crippled anyway then it might
turn out we just do not care all that much.
Pingfan Liu March 5, 2019, 12:37 p.m. UTC | #4
On Tue, Feb 26, 2019 at 8:09 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 26-02-19 13:47:37, Pingfan Liu wrote:
> > On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> > > > There are NUMA machines with memory-less node. At present page allocator builds the
> > > > full fallback info by build_zonelists(). But memblock allocator does not utilize
> > > > this info. And for memory-less node, memblock allocator just falls back "node 0",
> > > > without utilizing the nearest node. Unfortunately, the percpu section is allocated
> > > > by memblock, which is accessed frequently after bootup.
> > > >
> > > > This series aims to improve the performance of per cpu section on memory-less node
> > > > by feeding node's fallback info to memblock allocator on x86, like we do for page
> > > > allocator. On other archs, it requires independent effort to setup node to cpumask
> > > > map ahead.
> > >
> > > Do you have any numbers to tell us how much does this improve the
> > > situation?
> >
> > Not yet. At present just based on the fact that we prefer to allocate
> > per cpu area on local node.
>
> Yes, we _usually_ do. But the additional complexity should be worth it.
> And if we find out that the final improvement is not all that great and
> considering that memory-less setups are crippled anyway then it might
> turn out we just do not care all that much.
> --
I had finished some test on a "Dell Inc. PowerEdge R7425/02MJ3T"
machine, which owns 8 numa node. and the topology is:
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            4096K
NUMA node0 CPU(s):   0,8,16,24
NUMA node1 CPU(s):   2,10,18,26
NUMA node2 CPU(s):   4,12,20,28
NUMA node3 CPU(s):   6,14,22,30
NUMA node4 CPU(s):   1,9,17,25
NUMA node5 CPU(s):   3,11,19,27
NUMA node6 CPU(s):   5,13,21,29
NUMA node7 CPU(s):   7,15,23,31

Here is the basic info about the NUMA machine. cpu 0 and 16 share the
same L3 cache. Only node 1 and 5 own memory. Using local node as
baseline, the memory write performance suffer 25% drop to nearest node
(i.e. writing data from node 0 to 1), and 78% drop to farthest node
(i.e. writing from 0 to 5).

I used a user space test case to get the performance difference
between the nearest node and the farthest. The case pins two tasks on
cpu 0 and 16. The case used two memory chunks, A which emulates a
small footprint of per cpu section, and B which emulates a large
footprint. Chunk B is always allocated on nearest node, while chunk A
switch between nearest node and the farthest to render comparable
result. To emulate around 2.5% access to per cpu area, the case
composes two groups of writing, 1 time to memory chunk A, then 40
times to chunk B.

On the nearest node, I used 4MB foot print, which is the same size as
L3 cache. And varying foot print from 2K -> 4K ->8K to emulate the
access to the per cpu section. For 2K and 4K, perf result can not tell
the difference exactly, due to the difference is smaller than the
variance. For 8K: 1.8% improvement, then the larger footprint, the
higher improvement in performance. But 8K means that a module
allocates 4K/per cpu in the section. This is not in practice.

So the changes may be not need.

Regards,
Pingfan