Message ID | 20231208025240.4744-4-gang.li@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | hugetlb: parallelize hugetlb page init on boot | expand |
> > list_for_each_entry(pw, &works, pw_list) > - queue_work(system_unbound_wq, &pw->pw_work); > + if (job->numa_aware) > + queue_work_node((++nid % num_node_state(N_MEMORY)), The nid may fall on a NUMA node with only memory but no CPU. In that case you may still put the work on the unbound queue. You could end up on one CPU node for work from all memory nodes without CPU. Is this what you want? Or you would like to spread them between CPU nodes? Tim > + system_unbound_wq, &pw->pw_work); > + else > + queue_work(system_unbound_wq, &pw->pw_work); > > /* Use the current thread, which saves starting a workqueue worker. */ > padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
On 2023/12/13 07:40, Tim Chen wrote: > >> >> list_for_each_entry(pw, &works, pw_list) >> - queue_work(system_unbound_wq, &pw->pw_work); >> + if (job->numa_aware) >> + queue_work_node((++nid % num_node_state(N_MEMORY)), > > The nid may fall on a NUMA node with only memory but no CPU. In that case you > may still put the work on the unbound queue. You could end up on one CPU node for work > from all memory nodes without CPU. Is this what you want? Or you would > like to spread them between CPU nodes? > > Tim Hi, thank you for your reminder. My intention was to fully utilize all memory bandwidth. For memory nodes without CPUs, I also hope to be able to spread them on different CPUs.
Hi Tim, According to queue_work_node, if there are no CPUs available on the given node, it will schedule to any available CPU. On 2023/12/18 14:46, Gang Li wrote: > On 2023/12/13 07:40, Tim Chen wrote: >> >>> list_for_each_entry(pw, &works, pw_list) >>> - queue_work(system_unbound_wq, &pw->pw_work); >>> + if (job->numa_aware) >>> + queue_work_node((++nid % num_node_state(N_MEMORY)), >> >> The nid may fall on a NUMA node with only memory but no CPU. In that >> case you >> may still put the work on the unbound queue. You could end up on one >> CPU node for work >> from all memory nodes without CPU. Is this what you want? Or you would >> like to spread them between CPU nodes? >> >> Tim > > Hi, thank you for your reminder. My intention was to fully utilize all > memory bandwidth. > > For memory nodes without CPUs, I also hope to be able to spread them on > different CPUs.
diff --git a/include/linux/padata.h b/include/linux/padata.h index 495b16b6b4d72..f6c58c30ed96a 100644 --- a/include/linux/padata.h +++ b/include/linux/padata.h @@ -137,6 +137,7 @@ struct padata_shell { * appropriate for one worker thread to do at once. * @max_threads: Max threads to use for the job, actual number may be less * depending on task size and minimum chunk size. + * @numa_aware: Dispatch jobs to different nodes. */ struct padata_mt_job { void (*thread_fn)(unsigned long start, unsigned long end, void *arg); @@ -146,6 +147,7 @@ struct padata_mt_job { unsigned long align; unsigned long min_chunk; int max_threads; + bool numa_aware; }; /** diff --git a/kernel/padata.c b/kernel/padata.c index 179fb1518070c..80f82c563e46a 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job) struct padata_work my_work, *pw; struct padata_mt_job_state ps; LIST_HEAD(works); - int nworks; + int nworks, nid; if (job->size == 0) return; @@ -517,7 +517,11 @@ void __init padata_do_multithreaded(struct padata_mt_job *job) ps.chunk_size = roundup(ps.chunk_size, job->align); list_for_each_entry(pw, &works, pw_list) - queue_work(system_unbound_wq, &pw->pw_work); + if (job->numa_aware) + queue_work_node((++nid % num_node_state(N_MEMORY)), + system_unbound_wq, &pw->pw_work); + else + queue_work(system_unbound_wq, &pw->pw_work); /* Use the current thread, which saves starting a workqueue worker. */ padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK); diff --git a/mm/mm_init.c b/mm/mm_init.c index 077bfe393b5e2..1226f0c81fcb3 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -2234,6 +2234,7 @@ static int __init deferred_init_memmap(void *data) .align = PAGES_PER_SECTION, .min_chunk = PAGES_PER_SECTION, .max_threads = max_threads, + .numa_aware = false, }; padata_do_multithreaded(&job);
When a group of tasks that access different nodes are scheduled on the same node, they may encounter bandwidth bottlenecks and access latency. Thus, numa_aware flag is introduced here, allowing tasks to be distributed across different nodes to fully utilize the advantage of multi-node systems. Signed-off-by: Gang Li <gang.li@linux.dev> --- include/linux/padata.h | 2 ++ kernel/padata.c | 8 ++++++-- mm/mm_init.c | 1 + 3 files changed, 9 insertions(+), 2 deletions(-)