Message ID | 20200414081951.297676-1-ying.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] autonuma: Support to scan page table asynchronously | expand |
On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote: > In current AutoNUMA implementation, the page tables of the processes > are scanned periodically to trigger the NUMA hint page faults. The > scanning runs in the context of the processes, so will delay the > running of the processes. In a test with 64 threads pmbench memory > accessing benchmark on a 2-socket server machine with 104 logical CPUs > and 256 GB memory, there are more than 20000 latency outliers that are > > 1 ms in 3600s run time. These latency outliers are almost all > caused by the AutoNUMA page table scanning. Because they almost all > disappear after applying this patch to scan the page tables > asynchronously. > > Because there are idle CPUs in system, the asynchronous running page > table scanning code can run on these idle CPUs instead of the CPUs the > workload is running on. > > So on system with enough idle CPU time, it's better to scan the page > tables asynchronously to take full advantages of these idle CPU time. > Another scenario which can benefit from this is to scan the page > tables on some service CPUs of the socket, so that the real workload > can run on the isolated CPUs without the latency outliers caused by > the page table scanning. > > But it's not perfect to scan page tables asynchronously too. For > example, on system without enough idle CPU time, the CPU time isn't > scheduled fairly because the page table scanning is charged to the > workqueue thread instead of the process/thread it works for. And > although the page tables are scanned for the target process, it may > run on a CPU that is not in the cpuset of the target process. > > One possible solution is to let the system administrator to choose the > better behavior for the system via a sysctl knob (implemented in the > patch). But it's not perfect too. Because every user space knob adds > maintenance overhead. > > A better solution may be to back-charge the CPU time to scan the page > tables to the process/thread, and find a way to run the work on the > proper cpuset. After some googling, I found there's some discussion > about this as in the following thread, > > https://lkml.org/lkml/2019/6/13/1321 > > So this patch may be not ready to be merged by upstream yet. It > quantizes the latency outliers caused by the page table scanning in > AutoNUMA. And it provides a possible way to resolve the issue for > users who cares about it. And it is a potential customer of the work > related to the cgroup-aware workqueue or other asynchronous execution > mechanisms. > The caveats you list are the important ones and the reason why it was not done asynchronously. In an earlier implementation all the work was done by a dedicated thread and ultimately abandoned. There is no guarantee there is an idle CPU available and one that is local to the thread that should be doing the scanning. Even if there is, it potentially prevents another task from scheduling on an idle CPU and similarly other workqueue tasks may be delayed waiting on the scanner. The hiding of the cost is also problematic because the CPU cost is hidden and mixed with other unrelated workqueues. It also has the potential to mask bugs. Lets say for example there is a bug whereby a task is scanning excessively, that can be easily missed when the work is done by a workqueue. While it's just an opinion, my preference would be to focus on reducing the cost and amount of scanning done -- particularly for threads. For example, all threads operate on the same address space but there can be significant overlap where all threads are potentially scanning the same areas or regions that the thread has no interest in. One option would be to track the highest and lowest pages accessed and only scan within those regions for example. The tricky part is that library pages may create very wide windows that render the tracking useless but it could at least be investigated.
Mel Gorman <mgorman@techsingularity.net> writes: > On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote: >> In current AutoNUMA implementation, the page tables of the processes >> are scanned periodically to trigger the NUMA hint page faults. The >> scanning runs in the context of the processes, so will delay the >> running of the processes. In a test with 64 threads pmbench memory >> accessing benchmark on a 2-socket server machine with 104 logical CPUs >> and 256 GB memory, there are more than 20000 latency outliers that are >> > 1 ms in 3600s run time. These latency outliers are almost all >> caused by the AutoNUMA page table scanning. Because they almost all >> disappear after applying this patch to scan the page tables >> asynchronously. >> >> Because there are idle CPUs in system, the asynchronous running page >> table scanning code can run on these idle CPUs instead of the CPUs the >> workload is running on. >> >> So on system with enough idle CPU time, it's better to scan the page >> tables asynchronously to take full advantages of these idle CPU time. >> Another scenario which can benefit from this is to scan the page >> tables on some service CPUs of the socket, so that the real workload >> can run on the isolated CPUs without the latency outliers caused by >> the page table scanning. >> >> But it's not perfect to scan page tables asynchronously too. For >> example, on system without enough idle CPU time, the CPU time isn't >> scheduled fairly because the page table scanning is charged to the >> workqueue thread instead of the process/thread it works for. And >> although the page tables are scanned for the target process, it may >> run on a CPU that is not in the cpuset of the target process. >> >> One possible solution is to let the system administrator to choose the >> better behavior for the system via a sysctl knob (implemented in the >> patch). But it's not perfect too. Because every user space knob adds >> maintenance overhead. >> >> A better solution may be to back-charge the CPU time to scan the page >> tables to the process/thread, and find a way to run the work on the >> proper cpuset. After some googling, I found there's some discussion >> about this as in the following thread, >> >> https://lkml.org/lkml/2019/6/13/1321 >> >> So this patch may be not ready to be merged by upstream yet. It >> quantizes the latency outliers caused by the page table scanning in >> AutoNUMA. And it provides a possible way to resolve the issue for >> users who cares about it. And it is a potential customer of the work >> related to the cgroup-aware workqueue or other asynchronous execution >> mechanisms. >> > > The caveats you list are the important ones and the reason why it was > not done asynchronously. In an earlier implementation all the work was > done by a dedicated thread and ultimately abandoned. > > There is no guarantee there is an idle CPU available and one that is > local to the thread that should be doing the scanning. Even if there is, > it potentially prevents another task from scheduling on an idle CPU and > similarly other workqueue tasks may be delayed waiting on the scanner. The > hiding of the cost is also problematic because the CPU cost is hidden > and mixed with other unrelated workqueues. It also has the potential > to mask bugs. Lets say for example there is a bug whereby a task is > scanning excessively, that can be easily missed when the work is done by > a workqueue. Do you think something like cgroup-aware workqueue is a solution deserve to be tried when it's available? It will not hide the scanning cost, because the CPU time will be charged to the original cgroup or task. Although the other tasks may be disturbed, cgroup can provide some kind of management via cpusets. > While it's just an opinion, my preference would be to focus on reducing > the cost and amount of scanning done -- particularly for threads. For > example, all threads operate on the same address space but there can be > significant overlap where all threads are potentially scanning the same > areas or regions that the thread has no interest in. One option would be > to track the highest and lowest pages accessed and only scan within > those regions for example. The tricky part is that library pages may > create very wide windows that render the tracking useless but it could > at least be investigated. In general, I think it's good to reduce the scanning cost. Why do you think there will be overlap between the threads of a process? If my understanding were correctly, the threads will scan one by one instead of simultaneously. And how to determine whether a vma need to be scanned or not? For example, there may be only a small portion of pages been accessed in a vma, but they may be accessed remotely and consumes quite some inter-node bandwidth, so need to be migrated. Best Regards, Huang, Ying
On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote: > While it's just an opinion, my preference would be to focus on reducing > the cost and amount of scanning done -- particularly for threads. This; I really don't believe in those back-charging things, esp. since not having cgroups or having multiple applications in a single cgroup is a valid setup. Another way to reduce latency spikes is to decrease both sysctl_numa_balancing_scan_delay and sysctl_numa_balancing_scan_size. Then you do more smaller scans. By scanning more often you reduce the contrast, by reducing the size you lower the max latency. And this is all assuming you actually want numa balancing for this process.
Peter Zijlstra <peterz@infradead.org> writes: > On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote: >> While it's just an opinion, my preference would be to focus on reducing >> the cost and amount of scanning done -- particularly for threads. > > This; I really don't believe in those back-charging things, esp. since > not having cgroups or having multiple applications in a single cgroup is > a valid setup. Technically, it appears possible to back-charge the CPU time to the process/thread directly (not the cgroup). > Another way to reduce latency spikes is to decrease both > sysctl_numa_balancing_scan_delay and sysctl_numa_balancing_scan_size. > Then you do more smaller scans. By scanning more often you reduce the > contrast, by reducing the size you lower the max latency. Yes. This can reduce latency spikes. Best Regards, Huang, Ying > And this is all assuming you actually want numa balancing for this > process.
On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote: > I think the main idea of DAMON[1] might be able to applied here. Have you > considered it? > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/ I've ignored that entire thing after you said the information it provides was already available through the PMU.
On Thu, Apr 16, 2020 at 09:24:35AM +0800, Huang, Ying wrote: > Peter Zijlstra <peterz@infradead.org> writes: > > > On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote: > >> While it's just an opinion, my preference would be to focus on reducing > >> the cost and amount of scanning done -- particularly for threads. > > > > This; I really don't believe in those back-charging things, esp. since > > not having cgroups or having multiple applications in a single cgroup is > > a valid setup. > > Technically, it appears possible to back-charge the CPU time to the > process/thread directly (not the cgroup). I've yet to see a sane proposal there. What we're not going to do is make regular task accounting more expensive than it already is.
On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote: > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote: > > > I think the main idea of DAMON[1] might be able to applied here. Have you > > > considered it? > > > > > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/ > > > > I've ignored that entire thing after you said the information it > > provides was already available through the PMU. > > Sorry if my answer made you confused. What I wanted to say was that the > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit > for now, but it could be modified to use PMU or other features instead. I would not be inclined to lean towards either approach for NUMA balancing. Fiddling with the accessed bit can have consequences for page aging and residency -- fine for debugging a problem, not to fine for normal usage. I would expect the PMU approach would have high overhead as well as taking over a PMU counter that userspace debugging may expect to be available.
On Fri, Apr 17, 2020 at 01:16:29PM +0100, Mel Gorman wrote: > On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote: > > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote: > > > > I think the main idea of DAMON[1] might be able to applied here. Have you > > > > considered it? > > > > > > > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/ > > > > > > I've ignored that entire thing after you said the information it > > > provides was already available through the PMU. > > > > Sorry if my answer made you confused. What I wanted to say was that the > > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit > > for now, but it could be modified to use PMU or other features instead. > > I would not be inclined to lean towards either approach for NUMA > balancing. Fiddling with the accessed bit can have consequences for page > aging and residency -- fine for debugging a problem, not to fine for > normal usage. I would expect the PMU approach would have high overhead > as well as taking over a PMU counter that userspace debugging may expect > to be available. Oh, quite agreed; I was just saying I never saw the use of that whole DAMON thing. AFAICT it's not actually solving a problem, just making more.
On Fri, Apr 17, 2020 at 02:44:37PM +0200, SeongJae Park wrote: > On Fri, 17 Apr 2020 13:16:29 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > > > On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote: > > > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote: > > > > > I think the main idea of DAMON[1] might be able to applied here. Have you > > > > > considered it? > > > > > > > > > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/ > > > > > > > > I've ignored that entire thing after you said the information it > > > > provides was already available through the PMU. > > > > > > Sorry if my answer made you confused. What I wanted to say was that the > > > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit > > > for now, but it could be modified to use PMU or other features instead. > > > > I would not be inclined to lean towards either approach for NUMA > > balancing. Fiddling with the accessed bit can have consequences for page > > aging and residency -- fine for debugging a problem, not to fine for > > normal usage. I would expect the PMU approach would have high overhead > > as well as taking over a PMU counter that userspace debugging may expect > > to be available. > > Good point. But, isn't it ok to use Accessed bit as long as PG_Idle and > PG_Young is adjusted properly? Current DAMON implementation does so, as > idle_page_tracking also does. > PG_idle and PG_young are page flags that are used by idle tracking. The accessed bit I'm thinking of is in the PTE bit and it's the PTE that is changed by page_referenced() during reclaim. So it's not clear how adjusting the page bits help avoid page aging problems during reclaim. Maybe your suggestion was to move NUMA balancing to use the PG_idle and PG_young bits from idle tracking but I'm not sure what that gains us. This may be because I did not look too closely at DAMON as for debugging and analysis, the PMU sounded more suitable. It's not clear to me how looking at the page or pte bit handling of DAMON helps us reduce the scanning cost of numa balancing. There may be some benefit in splitting the address space and rescanning sections that were referenced but it has similar pitfalls to simply tracking the highest/lowest address that trapped a NUMA hinting fault.
Peter Zijlstra <peterz@infradead.org> writes: > On Thu, Apr 16, 2020 at 09:24:35AM +0800, Huang, Ying wrote: >> Peter Zijlstra <peterz@infradead.org> writes: >> >> > On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote: >> >> While it's just an opinion, my preference would be to focus on reducing >> >> the cost and amount of scanning done -- particularly for threads. >> > >> > This; I really don't believe in those back-charging things, esp. since >> > not having cgroups or having multiple applications in a single cgroup is >> > a valid setup. >> >> Technically, it appears possible to back-charge the CPU time to the >> process/thread directly (not the cgroup). > > I've yet to see a sane proposal there. What we're not going to do is > make regular task accounting more expensive than it already is. Yes. There's overhead to back-charge. To reduce the overhead, instead of back-charge immediately, we can - Add one field to task_struct, say backcharge_time, to track the delayed back-charged CPU time. - When the work item completes its work, add the CPU time it spends to task_struct->backcharge_time atomically - When the task account CPU regularly, e.g. in scheduler_tick(), task_struct->backcharge is considered too. Although this cannot eliminate the overhead, it can reduce it. Do you think this is acceptable or not? Best Regards, Huang, Ying
diff --git a/include/linux/sched.h b/include/linux/sched.h index 04278493bf15..433f77436ed4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1087,6 +1087,7 @@ struct task_struct { u64 last_task_numa_placement; u64 last_sum_exec_runtime; struct callback_head numa_work; + struct work_struct numa_async_work; /* * This pointer is only modified for current in syscall and diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index d4f6215ee03f..ea595ae9e573 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -37,6 +37,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; +extern unsigned int sysctl_numa_balancing_scan_async; #ifdef CONFIG_SCHED_DEBUG extern __read_mostly unsigned int sysctl_sched_migration_cost; diff --git a/kernel/exit.c b/kernel/exit.c index 0b81b26a872a..f5e84ff4b788 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -798,6 +798,7 @@ void __noreturn do_exit(long code) if (group_dead) disassociate_ctty(1); exit_task_namespaces(tsk); + cancel_work_sync(&tsk->numa_async_work); exit_task_work(tsk); exit_thread(tsk); exit_umh(tsk); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c1217bfe5e81..7d82c6826708 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1064,6 +1064,9 @@ unsigned int sysctl_numa_balancing_scan_size = 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ unsigned int sysctl_numa_balancing_scan_delay = 1000; +/* Scan asynchronously via work queue */ +unsigned int sysctl_numa_balancing_scan_async; + struct numa_group { refcount_t refcount; @@ -2481,24 +2484,17 @@ static void reset_ptenuma_scan(struct task_struct *p) p->mm->numa_scan_offset = 0; } -/* - * The expensive part of numa migration is done from task_work context. - * Triggered from task_tick_numa(). - */ -static void task_numa_work(struct callback_head *work) +static void numa_scan(struct task_struct *p) { unsigned long migrate, next_scan, now = jiffies; - struct task_struct *p = current; + struct task_struct *pc = current; struct mm_struct *mm = p->mm; - u64 runtime = p->se.sum_exec_runtime; + u64 runtime = pc->se.sum_exec_runtime; struct vm_area_struct *vma; unsigned long start, end; unsigned long nr_pte_updates = 0; long pages, virtpages; - SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work)); - - work->next = work; /* * Who cares about NUMA placement when they're dying. * @@ -2621,12 +2617,36 @@ static void task_numa_work(struct callback_head *work) * Usually update_task_scan_period slows down scanning enough; on an * overloaded system we need to limit overhead on a per task basis. */ - if (unlikely(p->se.sum_exec_runtime != runtime)) { - u64 diff = p->se.sum_exec_runtime - runtime; + if (unlikely(pc->se.sum_exec_runtime != runtime)) { + u64 diff = pc->se.sum_exec_runtime - runtime; p->node_stamp += 32 * diff; } } +/* + * The expensive part of numa migration is done from task_work context. + * Triggered from task_tick_numa(). + */ +static void task_numa_work(struct callback_head *work) +{ + struct task_struct *p = current; + + SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work)); + + work->next = work; + + numa_scan(p); +} + +static void numa_async_work(struct work_struct *work) +{ + struct task_struct *p; + + p = container_of(work, struct task_struct, numa_async_work); + + numa_scan(p); +} + void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) { int mm_users = 0; @@ -2650,6 +2670,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) p->last_sum_exec_runtime = 0; init_task_work(&p->numa_work, task_numa_work); + INIT_WORK(&p->numa_async_work, numa_async_work); /* New address space, reset the preferred nid */ if (!(clone_flags & CLONE_VM)) { @@ -2699,8 +2720,14 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr) curr->numa_scan_period = task_scan_start(curr); curr->node_stamp += period; - if (!time_before(jiffies, curr->mm->numa_next_scan)) - task_work_add(curr, work, true); + if (!time_before(jiffies, curr->mm->numa_next_scan)) { + if (sysctl_numa_balancing_scan_async) + queue_work_node(numa_node_id(), + system_unbound_wq, + &curr->numa_async_work); + else + task_work_add(curr, work, true); + } } } diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ad5b88a53c5a..31a987230c58 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -418,6 +418,15 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ONE, }, + { + .procname = "numa_balancing_scan_async", + .data = &sysctl_numa_balancing_scan_async, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, { .procname = "numa_balancing", .data = NULL, /* filled in by handler */