Message ID | 1551011649-30103-1-git-send-email-kernelfans@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | make memblock allocator utilize the node's fallback info | expand |
On Sun 24-02-19 20:34:03, Pingfan Liu wrote: > There are NUMA machines with memory-less node. At present page allocator builds the > full fallback info by build_zonelists(). But memblock allocator does not utilize > this info. And for memory-less node, memblock allocator just falls back "node 0", > without utilizing the nearest node. Unfortunately, the percpu section is allocated > by memblock, which is accessed frequently after bootup. > > This series aims to improve the performance of per cpu section on memory-less node > by feeding node's fallback info to memblock allocator on x86, like we do for page > allocator. On other archs, it requires independent effort to setup node to cpumask > map ahead. Do you have any numbers to tell us how much does this improve the situation?
On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Sun 24-02-19 20:34:03, Pingfan Liu wrote: > > There are NUMA machines with memory-less node. At present page allocator builds the > > full fallback info by build_zonelists(). But memblock allocator does not utilize > > this info. And for memory-less node, memblock allocator just falls back "node 0", > > without utilizing the nearest node. Unfortunately, the percpu section is allocated > > by memblock, which is accessed frequently after bootup. > > > > This series aims to improve the performance of per cpu section on memory-less node > > by feeding node's fallback info to memblock allocator on x86, like we do for page > > allocator. On other archs, it requires independent effort to setup node to cpumask > > map ahead. > > Do you have any numbers to tell us how much does this improve the > situation? Not yet. At present just based on the fact that we prefer to allocate per cpu area on local node. Thanks, Pingfan
On Tue 26-02-19 13:47:37, Pingfan Liu wrote: > On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Sun 24-02-19 20:34:03, Pingfan Liu wrote: > > > There are NUMA machines with memory-less node. At present page allocator builds the > > > full fallback info by build_zonelists(). But memblock allocator does not utilize > > > this info. And for memory-less node, memblock allocator just falls back "node 0", > > > without utilizing the nearest node. Unfortunately, the percpu section is allocated > > > by memblock, which is accessed frequently after bootup. > > > > > > This series aims to improve the performance of per cpu section on memory-less node > > > by feeding node's fallback info to memblock allocator on x86, like we do for page > > > allocator. On other archs, it requires independent effort to setup node to cpumask > > > map ahead. > > > > Do you have any numbers to tell us how much does this improve the > > situation? > > Not yet. At present just based on the fact that we prefer to allocate > per cpu area on local node. Yes, we _usually_ do. But the additional complexity should be worth it. And if we find out that the final improvement is not all that great and considering that memory-less setups are crippled anyway then it might turn out we just do not care all that much.
On Tue, Feb 26, 2019 at 8:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 26-02-19 13:47:37, Pingfan Liu wrote: > > On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Sun 24-02-19 20:34:03, Pingfan Liu wrote: > > > > There are NUMA machines with memory-less node. At present page allocator builds the > > > > full fallback info by build_zonelists(). But memblock allocator does not utilize > > > > this info. And for memory-less node, memblock allocator just falls back "node 0", > > > > without utilizing the nearest node. Unfortunately, the percpu section is allocated > > > > by memblock, which is accessed frequently after bootup. > > > > > > > > This series aims to improve the performance of per cpu section on memory-less node > > > > by feeding node's fallback info to memblock allocator on x86, like we do for page > > > > allocator. On other archs, it requires independent effort to setup node to cpumask > > > > map ahead. > > > > > > Do you have any numbers to tell us how much does this improve the > > > situation? > > > > Not yet. At present just based on the fact that we prefer to allocate > > per cpu area on local node. > > Yes, we _usually_ do. But the additional complexity should be worth it. > And if we find out that the final improvement is not all that great and > considering that memory-less setups are crippled anyway then it might > turn out we just do not care all that much. > -- I had finished some test on a "Dell Inc. PowerEdge R7425/02MJ3T" machine, which owns 8 numa node. and the topology is: L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 4096K NUMA node0 CPU(s): 0,8,16,24 NUMA node1 CPU(s): 2,10,18,26 NUMA node2 CPU(s): 4,12,20,28 NUMA node3 CPU(s): 6,14,22,30 NUMA node4 CPU(s): 1,9,17,25 NUMA node5 CPU(s): 3,11,19,27 NUMA node6 CPU(s): 5,13,21,29 NUMA node7 CPU(s): 7,15,23,31 Here is the basic info about the NUMA machine. cpu 0 and 16 share the same L3 cache. Only node 1 and 5 own memory. Using local node as baseline, the memory write performance suffer 25% drop to nearest node (i.e. writing data from node 0 to 1), and 78% drop to farthest node (i.e. writing from 0 to 5). I used a user space test case to get the performance difference between the nearest node and the farthest. The case pins two tasks on cpu 0 and 16. The case used two memory chunks, A which emulates a small footprint of per cpu section, and B which emulates a large footprint. Chunk B is always allocated on nearest node, while chunk A switch between nearest node and the farthest to render comparable result. To emulate around 2.5% access to per cpu area, the case composes two groups of writing, 1 time to memory chunk A, then 40 times to chunk B. On the nearest node, I used 4MB foot print, which is the same size as L3 cache. And varying foot print from 2K -> 4K ->8K to emulate the access to the per cpu section. For 2K and 4K, perf result can not tell the difference exactly, due to the difference is smaller than the variance. For 8K: 1.8% improvement, then the larger footprint, the higher improvement in performance. But 8K means that a module allocates 4K/per cpu in the section. This is not in practice. So the changes may be not need. Regards, Pingfan