Message ID | 20180820085516.9687-1-osalvador@techadventures.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Fix comment for NODEMASK_ALLOC | expand |
On Mon, 20 Aug 2018 10:55:16 +0200 Oscar Salvador <osalvador@techadventures.net> wrote: > From: Oscar Salvador <osalvador@suse.de> > > Currently, NODEMASK_ALLOC allocates a nodemask_t with kmalloc when > NODES_SHIFT is higher than 8, otherwise it declares it within the stack. > > The comment says that the reasoning behind this, is that nodemask_t will be > 256 bytes when NODES_SHIFT is higher than 8, but this is not true. > For example, NODES_SHIFT = 9 will give us a 64 bytes nodemask_t. > Let us fix up the comment for that. > > Another thing is that it might make sense to let values lower than 128bytes > be allocated in the stack. > Although this all depends on the depth of the stack > (and this changes from function to function), I think that 64 bytes > is something we can easily afford. > So we could even bump the limit by 1 (from > 8 to > 9). > I agree. Such a change will reduce the amount of testing which the kmalloc version receives, but I assume there are enough people out there testing with large NODES_SHIFT values. And while we're looking at this, it would be nice to make NODES_SHIFT go away. Ensure that CONFIG_NODES_SHIFT always has a setting and use that directly.
On Mon 20-08-18 14:24:40, Andrew Morton wrote: > On Mon, 20 Aug 2018 10:55:16 +0200 Oscar Salvador <osalvador@techadventures.net> wrote: > > > From: Oscar Salvador <osalvador@suse.de> > > > > Currently, NODEMASK_ALLOC allocates a nodemask_t with kmalloc when > > NODES_SHIFT is higher than 8, otherwise it declares it within the stack. > > > > The comment says that the reasoning behind this, is that nodemask_t will be > > 256 bytes when NODES_SHIFT is higher than 8, but this is not true. > > For example, NODES_SHIFT = 9 will give us a 64 bytes nodemask_t. > > Let us fix up the comment for that. > > > > Another thing is that it might make sense to let values lower than 128bytes > > be allocated in the stack. > > Although this all depends on the depth of the stack > > (and this changes from function to function), I think that 64 bytes > > is something we can easily afford. > > So we could even bump the limit by 1 (from > 8 to > 9). > > > > I agree. Such a change will reduce the amount of testing which the > kmalloc version receives, but I assume there are enough people out > there testing with large NODES_SHIFT values. We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some time (around SLE11-SP3 AFAICS). Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do larger than 1024 NUMA nodes? This would be 128B and from a quick glance it seems that none of those functions are called in deep stacks. I haven't gone through all of them but a patch which checks them all and removes NODES_ALLOC would be quite nice IMHO.
On Tue, Aug 21, 2018 at 02:17:34PM +0200, Michal Hocko wrote: > We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some > time (around SLE11-SP3 AFAICS). > > Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do > larger than 1024 NUMA nodes? This would be 128B and from a quick glance > it seems that none of those functions are called in deep stacks. I > haven't gone through all of them but a patch which checks them all and > removes NODES_ALLOC would be quite nice IMHO. No, maximum we can get is 1024 NUMA nodes. I checked this when writing another patch [1], and since having gone through all archs Kconfigs, CONFIG_NODES_SHIFT=10 is the limit. NODEMASK_ALLOC gets only called from: - unregister_mem_sect_under_nodes() (not anymore after [1]) - __nr_hugepages_store_common (This does not seem to have a deep stack, we could use a normal nodemask_t) But is also used for NODEMASK_SCRATCH (mainly used for mempolicy): struct nodemask_scratch { nodemask_t mask1; nodemask_t mask2; }; that would make 256 bytes in case CONFIG_NODES_SHIFT=10. I am not familiar with mempolicy code, I am not sure if we can do without that and figure out another way to achieve the same. [1] https://patchwork.kernel.org/patch/10566673/#22179663
On Tue 21-08-18 14:30:24, Oscar Salvador wrote: > On Tue, Aug 21, 2018 at 02:17:34PM +0200, Michal Hocko wrote: > > We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some > > time (around SLE11-SP3 AFAICS). > > > > Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do > > larger than 1024 NUMA nodes? This would be 128B and from a quick glance > > it seems that none of those functions are called in deep stacks. I > > haven't gone through all of them but a patch which checks them all and > > removes NODES_ALLOC would be quite nice IMHO. > > No, maximum we can get is 1024 NUMA nodes. > I checked this when writing another patch [1], and since having gone > through all archs Kconfigs, CONFIG_NODES_SHIFT=10 is the limit. > > NODEMASK_ALLOC gets only called from: > > - unregister_mem_sect_under_nodes() (not anymore after [1]) > - __nr_hugepages_store_common (This does not seem to have a deep stack, we could use a normal nodemask_t) > > But is also used for NODEMASK_SCRATCH (mainly used for mempolicy): mempolicy code should be a shallow stack as well. Mostly the syscall entry.
On Tue, Aug 21, 2018 at 02:51:56PM +0200, Michal Hocko wrote: > On Tue 21-08-18 14:30:24, Oscar Salvador wrote: > > On Tue, Aug 21, 2018 at 02:17:34PM +0200, Michal Hocko wrote: > > > We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some > > > time (around SLE11-SP3 AFAICS). > > > > > > Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do > > > larger than 1024 NUMA nodes? This would be 128B and from a quick glance > > > it seems that none of those functions are called in deep stacks. I > > > haven't gone through all of them but a patch which checks them all and > > > removes NODES_ALLOC would be quite nice IMHO. > > > > No, maximum we can get is 1024 NUMA nodes. > > I checked this when writing another patch [1], and since having gone > > through all archs Kconfigs, CONFIG_NODES_SHIFT=10 is the limit. > > > > NODEMASK_ALLOC gets only called from: > > > > - unregister_mem_sect_under_nodes() (not anymore after [1]) > > - __nr_hugepages_store_common (This does not seem to have a deep stack, we could use a normal nodemask_t) > > > > But is also used for NODEMASK_SCRATCH (mainly used for mempolicy): > > mempolicy code should be a shallow stack as well. Mostly the syscall > entry. Ok, then I could give it a try and see if we can get rid of NODEMASK_ALLOC in there as well.
On Tue, 21 Aug 2018 14:30:24 +0200 Oscar Salvador <osalvador@techadventures.net> wrote: > On Tue, Aug 21, 2018 at 02:17:34PM +0200, Michal Hocko wrote: > > We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some > > time (around SLE11-SP3 AFAICS). > > > > Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do > > larger than 1024 NUMA nodes? This would be 128B and from a quick glance > > it seems that none of those functions are called in deep stacks. I > > haven't gone through all of them but a patch which checks them all and > > removes NODES_ALLOC would be quite nice IMHO. > > No, maximum we can get is 1024 NUMA nodes. > I checked this when writing another patch [1], and since having gone > through all archs Kconfigs, CONFIG_NODES_SHIFT=10 is the limit. > > NODEMASK_ALLOC gets only called from: > > - unregister_mem_sect_under_nodes() (not anymore after [1]) > - __nr_hugepages_store_common (This does not seem to have a deep stack, we could use a normal nodemask_t) > > But is also used for NODEMASK_SCRATCH (mainly used for mempolicy): > > struct nodemask_scratch { > nodemask_t mask1; > nodemask_t mask2; > }; > > that would make 256 bytes in case CONFIG_NODES_SHIFT=10. And that sole site could use an open-coded kmalloc.
On Tue, Aug 21, 2018 at 01:51:59PM -0700, Andrew Morton wrote: > On Tue, 21 Aug 2018 14:30:24 +0200 Oscar Salvador <osalvador@techadventures.net> wrote: > > > On Tue, Aug 21, 2018 at 02:17:34PM +0200, Michal Hocko wrote: > > > We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some > > > time (around SLE11-SP3 AFAICS). > > > > > > Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do > > > larger than 1024 NUMA nodes? This would be 128B and from a quick glance > > > it seems that none of those functions are called in deep stacks. I > > > haven't gone through all of them but a patch which checks them all and > > > removes NODES_ALLOC would be quite nice IMHO. > > > > No, maximum we can get is 1024 NUMA nodes. > > I checked this when writing another patch [1], and since having gone > > through all archs Kconfigs, CONFIG_NODES_SHIFT=10 is the limit. > > > > NODEMASK_ALLOC gets only called from: > > > > - unregister_mem_sect_under_nodes() (not anymore after [1]) > > - __nr_hugepages_store_common (This does not seem to have a deep stack, we could use a normal nodemask_t) > > > > But is also used for NODEMASK_SCRATCH (mainly used for mempolicy): > > > > struct nodemask_scratch { > > nodemask_t mask1; > > nodemask_t mask2; > > }; > > > > that would make 256 bytes in case CONFIG_NODES_SHIFT=10. > > And that sole site could use an open-coded kmalloc. It is not really one single place, but four: - do_set_mempolicy() - do_mbind() - kernel_migrate_pages() - mpol_shared_policy_init() They get called in: - do_set_mempolicy() - From set_mempolicy syscall - From numa_policy_init() - From numa_default_policy() * All above do not look like they have a deep stack, so it should be possible to get rid of NODEMASK_SCRATCH there. - do_mbind - From mbind syscall * Should be feasible here as well. - kernel_migrate_pages() - From migrate_pages syscall * Again, this should be doable. - mpol_shared_policy_init() - From hugetlbfs_alloc_inode() - shmem_get_inode() * Seems doable for hugetlbfs_alloc_inode as well. I only got to check hugetlbfs_alloc_inode, because shmem_get_inode So it seems that this can be done in most of the places. The only tricky function might be mpol_shared_policy_init because of shmem_get_inode. But in that case, we could use an open-coded kmalloc there. Thanks
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 1fbde8a880d9..5a30ad594ccc 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -518,7 +518,7 @@ static inline int node_random(const nodemask_t *mask) * NODEMASK_ALLOC(type, name) allocates an object with a specified type and * name. */ -#if NODES_SHIFT > 8 /* nodemask_t > 256 bytes */ +#if NODES_SHIFT > 8 /* nodemask_t > 32 bytes */ #define NODEMASK_ALLOC(type, name, gfp_flags) \ type *name = kmalloc(sizeof(*name), gfp_flags) #define NODEMASK_FREE(m) kfree(m)