Message ID | 20221102020243.522358-1-leobras@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Avoid scheduling cache draining to isolated cpus | expand |
On Tue 01-11-22 23:02:40, Leonardo Bras wrote: > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus > closer (NUMA) to any desired CPU, instead of only the current CPU. > > ### Performance argument that motivated the change: > There could be an argument of why would that be needed, since the current > CPU is probably acessing the current cacheline, and so having a CPU closer > to the current one is always the best choice since the cache invalidation > will take less time. OTOH, there could be cases like this which uses > perCPU variables, and we can have up to 3 different CPUs touching the > cacheline: > > C1 - Isolated CPU: The perCPU data 'belongs' to this one > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu > C3 - Housekeeping CPU: This one will do the work > > Most of the times the cacheline is touched, it should be by C1. Some times > a C2 will schedule work to run on C3, since C1 is isolated. > > If C1 and C2 are in different NUMA nodes, we could have C3 either in > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node > (housekeeping_any_cpu_from(C1). > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3 > tries to get cacheline exclusivity, and then a slower invalidation when > this happens in C1, when it's working in its data. > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3 > tries to get cacheline exclusivity, and then a faster invalidation when > this happens in C1. > > The thing is: it should be better to wait less when doing kernel work > on an isolated CPU, even at the cost of some housekeeping CPU waiting > a few more cycles. > ### > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from > local_lock to spinlocks, so it can be later used to do remote percpu > cache draining on patch #3. Most performance concerns should be pointed > in the commit log. > > Patch #3 implements the remote per-CPU cache drain, making use of both > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should > introduce an extra function call and a single test to check if the CPU is > isolated. > > On scenarios with isolation enabled on boot, it will also introduce an > extra test to check in the cpumask if the CPU is isolated. If it is, > there will also be an extra read of the cpumask to look for a > housekeeping CPU. This is a rather deep dive in the cache line usage but the most important thing is really missing. Why do we want this change? From the context it seems that this is an actual fix for isolcpu= setup when remote (aka non isolated activity) interferes with isolated cpus by scheduling pcp charge caches on those cpus. Is this understanding correct? If yes, how big of a problem that is? If you want a remote draining then you need some sort of locking (currently we rely on local lock). How come this locking is not going to cause a different form of disturbance?
On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote: > On Tue 01-11-22 23:02:40, Leonardo Bras wrote: > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus > > closer (NUMA) to any desired CPU, instead of only the current CPU. > > > > ### Performance argument that motivated the change: > > There could be an argument of why would that be needed, since the current > > CPU is probably acessing the current cacheline, and so having a CPU closer > > to the current one is always the best choice since the cache invalidation > > will take less time. OTOH, there could be cases like this which uses > > perCPU variables, and we can have up to 3 different CPUs touching the > > cacheline: > > > > C1 - Isolated CPU: The perCPU data 'belongs' to this one > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu > > C3 - Housekeeping CPU: This one will do the work > > > > Most of the times the cacheline is touched, it should be by C1. Some times > > a C2 will schedule work to run on C3, since C1 is isolated. > > > > If C1 and C2 are in different NUMA nodes, we could have C3 either in > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node > > (housekeeping_any_cpu_from(C1). > > > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3 > > tries to get cacheline exclusivity, and then a slower invalidation when > > this happens in C1, when it's working in its data. > > > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3 > > tries to get cacheline exclusivity, and then a faster invalidation when > > this happens in C1. > > > > The thing is: it should be better to wait less when doing kernel work > > on an isolated CPU, even at the cost of some housekeeping CPU waiting > > a few more cycles. > > ### > > > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from > > local_lock to spinlocks, so it can be later used to do remote percpu > > cache draining on patch #3. Most performance concerns should be pointed > > in the commit log. > > > > Patch #3 implements the remote per-CPU cache drain, making use of both > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should > > introduce an extra function call and a single test to check if the CPU is > > isolated. > > > > On scenarios with isolation enabled on boot, it will also introduce an > > extra test to check in the cpumask if the CPU is isolated. If it is, > > there will also be an extra read of the cpumask to look for a > > housekeeping CPU. > Hello Michael, thanks for reviewing! > This is a rather deep dive in the cache line usage but the most > important thing is really missing. Why do we want this change? From the > context it seems that this is an actual fix for isolcpu= setup when > remote (aka non isolated activity) interferes with isolated cpus by > scheduling pcp charge caches on those cpus. > > Is this understanding correct? That's correct! The idea is to avoid scheduling work to isolated CPUs. > If yes, how big of a problem that is? The use case I have been following requires both isolcpus= and PREEMPT_RT, since the isolated CPUs will be running a real-time workload. In this scenario, getting any work done instead of the real-time workload may cause the system to miss a deadline, which can be bad. > If you want a remote draining then > you need some sort of locking (currently we rely on local lock). How > come this locking is not going to cause a different form of disturbance? If I did everything right, most of the extra work should be done either in non- isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches will be happening on a housekeeping CPU, and the locking cost should be paid there as we want to avoid doing that in the isolated CPUs. I understand there will be a locking cost being paid in the isolated CPUs when: a) The isolated CPU is requesting the stock drain, b) When the isolated CPUs do a syscall and end up using the protected structure the first time after a remote drain. Both (a) and (b) should happen during a syscall, and IIUC the a rt workload should not expect the syscalls to be have a predictable time, so it should be fine. Thanks for helping me explain the case! Best regards, Leo
On Thu 03-11-22 11:59:20, Leonardo Brás wrote: > On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote: > > On Tue 01-11-22 23:02:40, Leonardo Bras wrote: > > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus > > > closer (NUMA) to any desired CPU, instead of only the current CPU. > > > > > > ### Performance argument that motivated the change: > > > There could be an argument of why would that be needed, since the current > > > CPU is probably acessing the current cacheline, and so having a CPU closer > > > to the current one is always the best choice since the cache invalidation > > > will take less time. OTOH, there could be cases like this which uses > > > perCPU variables, and we can have up to 3 different CPUs touching the > > > cacheline: > > > > > > C1 - Isolated CPU: The perCPU data 'belongs' to this one > > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu > > > C3 - Housekeeping CPU: This one will do the work > > > > > > Most of the times the cacheline is touched, it should be by C1. Some times > > > a C2 will schedule work to run on C3, since C1 is isolated. > > > > > > If C1 and C2 are in different NUMA nodes, we could have C3 either in > > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node > > > (housekeeping_any_cpu_from(C1). > > > > > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3 > > > tries to get cacheline exclusivity, and then a slower invalidation when > > > this happens in C1, when it's working in its data. > > > > > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3 > > > tries to get cacheline exclusivity, and then a faster invalidation when > > > this happens in C1. > > > > > > The thing is: it should be better to wait less when doing kernel work > > > on an isolated CPU, even at the cost of some housekeeping CPU waiting > > > a few more cycles. > > > ### > > > > > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from > > > local_lock to spinlocks, so it can be later used to do remote percpu > > > cache draining on patch #3. Most performance concerns should be pointed > > > in the commit log. > > > > > > Patch #3 implements the remote per-CPU cache drain, making use of both > > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should > > > introduce an extra function call and a single test to check if the CPU is > > > isolated. > > > > > > On scenarios with isolation enabled on boot, it will also introduce an > > > extra test to check in the cpumask if the CPU is isolated. If it is, > > > there will also be an extra read of the cpumask to look for a > > > housekeeping CPU. > > > > Hello Michael, thanks for reviewing! > > > This is a rather deep dive in the cache line usage but the most > > important thing is really missing. Why do we want this change? From the > > context it seems that this is an actual fix for isolcpu= setup when > > remote (aka non isolated activity) interferes with isolated cpus by > > scheduling pcp charge caches on those cpus. > > > > Is this understanding correct? > > That's correct! The idea is to avoid scheduling work to isolated CPUs. > > > If yes, how big of a problem that is? > > The use case I have been following requires both isolcpus= and PREEMPT_RT, since > the isolated CPUs will be running a real-time workload. In this scenario, > getting any work done instead of the real-time workload may cause the system to > miss a deadline, which can be bad. OK, I see. But is memcg charging actually a RT friendly operation in the first place? Please note that this path can trigger memory reclaim and that is when any RT expectations are simply going down the drain. > > If you want a remote draining then > > you need some sort of locking (currently we rely on local lock). How > > come this locking is not going to cause a different form of disturbance? > > If I did everything right, most of the extra work should be done either in non- > isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches > will be happening on a housekeeping CPU, and the locking cost should be paid > there as we want to avoid doing that in the isolated CPUs. > > I understand there will be a locking cost being paid in the isolated CPUs when: > a) The isolated CPU is requesting the stock drain, > b) When the isolated CPUs do a syscall and end up using the protected structure > the first time after a remote drain. And anytime the charging path (consume_stock resp. refill_stock) contends with the remote draining which is out of control of the RT task. It is true that the RT kernel will turn that spin lock into a sleeping RT lock and that could help with potential priority inversions but still quite costly thing I would expect. > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload > should not expect the syscalls to be have a predictable time, so it should be > fine. Now I am not sure I understand. If you do not consider charging path to be RT sensitive then why is this needed in the first place? What else would be populating the pcp cache on the isolated cpu? IRQs?
On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote: > On Thu 03-11-22 11:59:20, Leonardo Brás wrote: > > On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote: > > > On Tue 01-11-22 23:02:40, Leonardo Bras wrote: > > > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus > > > > closer (NUMA) to any desired CPU, instead of only the current CPU. > > > > > > > > ### Performance argument that motivated the change: > > > > There could be an argument of why would that be needed, since the current > > > > CPU is probably acessing the current cacheline, and so having a CPU closer > > > > to the current one is always the best choice since the cache invalidation > > > > will take less time. OTOH, there could be cases like this which uses > > > > perCPU variables, and we can have up to 3 different CPUs touching the > > > > cacheline: > > > > > > > > C1 - Isolated CPU: The perCPU data 'belongs' to this one > > > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu > > > > C3 - Housekeeping CPU: This one will do the work > > > > > > > > Most of the times the cacheline is touched, it should be by C1. Some times > > > > a C2 will schedule work to run on C3, since C1 is isolated. > > > > > > > > If C1 and C2 are in different NUMA nodes, we could have C3 either in > > > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node > > > > (housekeeping_any_cpu_from(C1). > > > > > > > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3 > > > > tries to get cacheline exclusivity, and then a slower invalidation when > > > > this happens in C1, when it's working in its data. > > > > > > > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3 > > > > tries to get cacheline exclusivity, and then a faster invalidation when > > > > this happens in C1. > > > > > > > > The thing is: it should be better to wait less when doing kernel work > > > > on an isolated CPU, even at the cost of some housekeeping CPU waiting > > > > a few more cycles. > > > > ### > > > > > > > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from > > > > local_lock to spinlocks, so it can be later used to do remote percpu > > > > cache draining on patch #3. Most performance concerns should be pointed > > > > in the commit log. > > > > > > > > Patch #3 implements the remote per-CPU cache drain, making use of both > > > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should > > > > introduce an extra function call and a single test to check if the CPU is > > > > isolated. > > > > > > > > On scenarios with isolation enabled on boot, it will also introduce an > > > > extra test to check in the cpumask if the CPU is isolated. If it is, > > > > there will also be an extra read of the cpumask to look for a > > > > housekeeping CPU. > > > > > > > Hello Michael, thanks for reviewing! > > > > > This is a rather deep dive in the cache line usage but the most > > > important thing is really missing. Why do we want this change? From the > > > context it seems that this is an actual fix for isolcpu= setup when > > > remote (aka non isolated activity) interferes with isolated cpus by > > > scheduling pcp charge caches on those cpus. > > > > > > Is this understanding correct? > > > > That's correct! The idea is to avoid scheduling work to isolated CPUs. > > > > > If yes, how big of a problem that is? > > > > The use case I have been following requires both isolcpus= and PREEMPT_RT, since > > the isolated CPUs will be running a real-time workload. In this scenario, > > getting any work done instead of the real-time workload may cause the system to > > miss a deadline, which can be bad. > > OK, I see. But is memcg charging actually a RT friendly operation in the > first place? Please note that this path can trigger memory reclaim and > that is when any RT expectations are simply going down the drain. I understand the spent time for charging is unpredictable as you said, since a lot of slow stuff may or may not happen. > > > > If you want a remote draining then > > > you need some sort of locking (currently we rely on local lock). How > > > come this locking is not going to cause a different form of disturbance? > > > > If I did everything right, most of the extra work should be done either in non- > > isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches > > will be happening on a housekeeping CPU, and the locking cost should be paid > > there as we want to avoid doing that in the isolated CPUs. Sorry, I think this caused a misunderstanding: I meant "the pcp charge cache drain will be happening on a housekeeping CPU, ..." > > > > I understand there will be a locking cost being paid in the isolated CPUs when: > > a) The isolated CPU is requesting the stock drain, > > b) When the isolated CPUs do a syscall and end up using the protected structure > > the first time after a remote drain. > > And anytime the charging path (consume_stock resp. refill_stock) > contends with the remote draining which is out of control of the RT > task. It is true that the RT kernel will turn that spin lock into a > sleeping RT lock and that could help with potential priority inversions > but still quite costly thing I would expect. > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload > > should not expect the syscalls to be have a predictable time, so it should be > > fine. > > Now I am not sure I understand. If you do not consider charging path to > be RT sensitive then why is this needed in the first place? What else > would be populating the pcp cache on the isolated cpu? IRQs? I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu time with the RT workload, we can have preemption of the RT workload, which is a problem for meeting the deadlines. One way I thought to solve that was introducing a remote drain, which would require a different strategy for locking, since not all accesses to the pcp caches would happen on a local CPU. Then I tried to weight the costs of this, so the solution would introduce as little overhead as possible on no-isolation scenarios. Also, for isolation scenarios, I tried to put most of the overheads into the housekeeping CPUs, and the remaining on the syscalls, which are also expected to be non-predictable. Not sure if I could answer your question, though. Please let me know in case I missed anything. Thanks for helping me make it more clear! Best regards, Leo
On Thu 03-11-22 13:53:41, Leonardo Brás wrote: > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote: > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote: [...] > > > I understand there will be a locking cost being paid in the isolated CPUs when: > > > a) The isolated CPU is requesting the stock drain, > > > b) When the isolated CPUs do a syscall and end up using the protected structure > > > the first time after a remote drain. > > > > And anytime the charging path (consume_stock resp. refill_stock) > > contends with the remote draining which is out of control of the RT > > task. It is true that the RT kernel will turn that spin lock into a > > sleeping RT lock and that could help with potential priority inversions > > but still quite costly thing I would expect. > > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload > > > should not expect the syscalls to be have a predictable time, so it should be > > > fine. > > > > Now I am not sure I understand. If you do not consider charging path to > > be RT sensitive then why is this needed in the first place? What else > > would be populating the pcp cache on the isolated cpu? IRQs? > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu > time with the RT workload, we can have preemption of the RT workload, which is a > problem for meeting the deadlines. Yes, this is understood. But it is not really clear to me why would any draining be necessary for such an isolated CPU if no workload other than the RT (which pressumably doesn't charge any memory?) is running on that CPU? Is that the RT task during the initialization phase that leaves that cache behind or something else? Sorry for being so focused on this but I would like to understand on whether this is avoidable by a different startup scheme or it really needs to be addressed in some way. > One way I thought to solve that was introducing a remote drain, which would > require a different strategy for locking, since not all accesses to the pcp > caches would happen on a local CPU. Yeah, I am not supper happy about additional spin lock TBH. One potential way to go would be to completely avoid pcp cache for isolated CPUs. That would have some performance impact of course but on the other hand it would give a more predictable behavior for those CPUs which sounds like a reasonable compromise to me. What do you think?
On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote: > On Thu 03-11-22 13:53:41, Leonardo Brás wrote: > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote: > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote: > [...] > > > > I understand there will be a locking cost being paid in the isolated CPUs when: > > > > a) The isolated CPU is requesting the stock drain, > > > > b) When the isolated CPUs do a syscall and end up using the protected structure > > > > the first time after a remote drain. > > > > > > And anytime the charging path (consume_stock resp. refill_stock) > > > contends with the remote draining which is out of control of the RT > > > task. It is true that the RT kernel will turn that spin lock into a > > > sleeping RT lock and that could help with potential priority inversions > > > but still quite costly thing I would expect. > > > > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload > > > > should not expect the syscalls to be have a predictable time, so it should be > > > > fine. > > > > > > Now I am not sure I understand. If you do not consider charging path to > > > be RT sensitive then why is this needed in the first place? What else > > > would be populating the pcp cache on the isolated cpu? IRQs? > > > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu > > time with the RT workload, we can have preemption of the RT workload, which is a > > problem for meeting the deadlines. > > Yes, this is understood. But it is not really clear to me why would any > draining be necessary for such an isolated CPU if no workload other than > the RT (which pressumably doesn't charge any memory?) is running on that > CPU? Is that the RT task during the initialization phase that leaves > that cache behind or something else? (I am new to this part of the code, so please correct me when I miss something.) IIUC, if a process belongs to a control group with memory control, the 'charge' will happen when a memory page starts getting used by it. So, if we assume a RT load in a isolated CPU will not charge any memory, we are assuming it will never be part of a memory-controlled cgroup. I mean, can we just assume this? If I got that right, would not that be considered a limitation? like "If you don't want your workload to be interrupted by perCPU cache draining, don't put it in a cgroup with memory control". > Sorry for being so focused on this > but I would like to understand on whether this is avoidable by a > different startup scheme or it really needs to be addressed in some way. No worries, I am in fact happy you are giving it this much attention :) I also understand this is a considerable change in the locking strategy, and avoiding that is the first thing that should be tried. > > > One way I thought to solve that was introducing a remote drain, which would > > require a different strategy for locking, since not all accesses to the pcp > > caches would happen on a local CPU. > > Yeah, I am not supper happy about additional spin lock TBH. One > potential way to go would be to completely avoid pcp cache for isolated > CPUs. That would have some performance impact of course but on the other > hand it would give a more predictable behavior for those CPUs which > sounds like a reasonable compromise to me. What do you think? You mean not having a perCPU stock, then? So consume_stock() for isolated CPUs would always return false, causing try_charge_memcg() always walking the slow path? IIUC, both my proposal and yours would degrade performance only when we use isolated CPUs + memcg. Is that correct? If so, it looks like the impact would be even bigger without perCPU stock , compared to introducing a spinlock. Unless, we are counting to this case where a remote CPU is draining an isolated CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be released in the remote CPU. Well, this seems possible to happen, but I would have to analyze how often would it happen, and how much would it impact the deadlines. I *guess* most of the RT workload's memory pages are pre-faulted before its starts, so it can avoid the faulting latency, but I need to confirm that. On the other hand, compared to how it works now now, this should be a more controllable way of introducing latency than a scheduled cache drain. Your suggestion on no-stocks/caches in isolated CPUs would be great for predictability, but I am almost sure the cost in overall performance would not be fine. With the possibility of prefaulting pages, do you see any scenario that would introduce some undesirable latency in the workload? Thanks a lot for the discussion! Leo
On Fri 04-11-22 22:45:58, Leonardo Brás wrote: > On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote: > > On Thu 03-11-22 13:53:41, Leonardo Brás wrote: > > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote: > > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote: > > [...] > > > > > I understand there will be a locking cost being paid in the isolated CPUs when: > > > > > a) The isolated CPU is requesting the stock drain, > > > > > b) When the isolated CPUs do a syscall and end up using the protected structure > > > > > the first time after a remote drain. > > > > > > > > And anytime the charging path (consume_stock resp. refill_stock) > > > > contends with the remote draining which is out of control of the RT > > > > task. It is true that the RT kernel will turn that spin lock into a > > > > sleeping RT lock and that could help with potential priority inversions > > > > but still quite costly thing I would expect. > > > > > > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload > > > > > should not expect the syscalls to be have a predictable time, so it should be > > > > > fine. > > > > > > > > Now I am not sure I understand. If you do not consider charging path to > > > > be RT sensitive then why is this needed in the first place? What else > > > > would be populating the pcp cache on the isolated cpu? IRQs? > > > > > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at > > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu > > > time with the RT workload, we can have preemption of the RT workload, which is a > > > problem for meeting the deadlines. > > > > Yes, this is understood. But it is not really clear to me why would any > > draining be necessary for such an isolated CPU if no workload other than > > the RT (which pressumably doesn't charge any memory?) is running on that > > CPU? Is that the RT task during the initialization phase that leaves > > that cache behind or something else? > > (I am new to this part of the code, so please correct me when I miss something.) > > IIUC, if a process belongs to a control group with memory control, the 'charge' > will happen when a memory page starts getting used by it. Yes, very broadly speaking. > So, if we assume a RT load in a isolated CPU will not charge any memory, we are > assuming it will never be part of a memory-controlled cgroup. If the memory cgroup controler is enabled then each user space process is a part of some memcg. If there is no specific memcg assigned then it will be a root cgroup and that is skipped during most charges except for kmem. > I mean, can we just assume this? > > If I got that right, would not that be considered a limitation? like > "If you don't want your workload to be interrupted by perCPU cache draining, > don't put it in a cgroup with memory control". We definitely do not want userspace make any assumptions on internal implementation details like caches. > > Sorry for being so focused on this > > but I would like to understand on whether this is avoidable by a > > different startup scheme or it really needs to be addressed in some way. > > No worries, I am in fact happy you are giving it this much attention :) > > I also understand this is a considerable change in the locking strategy, and > avoiding that is the first thing that should be tried. > > > > > > One way I thought to solve that was introducing a remote drain, which would > > > require a different strategy for locking, since not all accesses to the pcp > > > caches would happen on a local CPU. > > > > Yeah, I am not supper happy about additional spin lock TBH. One > > potential way to go would be to completely avoid pcp cache for isolated > > CPUs. That would have some performance impact of course but on the other > > hand it would give a more predictable behavior for those CPUs which > > sounds like a reasonable compromise to me. What do you think? > > You mean not having a perCPU stock, then? > So consume_stock() for isolated CPUs would always return false, causing > try_charge_memcg() always walking the slow path? Exactly. > IIUC, both my proposal and yours would degrade performance only when we use > isolated CPUs + memcg. Is that correct? Yes, with a notable difference that with your spin lock option there is still a chance that the remote draining could influence the isolated CPU workload throug that said spinlock. If there is no pcp cache for that cpu being used then there is no potential interaction at all. > If so, it looks like the impact would be even bigger without perCPU stock , > compared to introducing a spinlock. > > Unless, we are counting to this case where a remote CPU is draining an isolated > CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be > released in the remote CPU. Well, this seems possible to happen, but I would > have to analyze how often would it happen, and how much would it impact the > deadlines. I *guess* most of the RT workload's memory pages are pre-faulted > before its starts, so it can avoid the faulting latency, but I need to confirm > that. Yes, that is a general practice and the reason why I was asking how real of a problem that is in practice. It is true true that appart from user space memory which can be under full control of the userspace there are kernel allocations which can be done on behalf of the process and those could be charged to memcg as well. So I can imagine the pcp cache could be populated even if the process is not faulting anything in during RT sensitive phase. > On the other hand, compared to how it works now now, this should be a more > controllable way of introducing latency than a scheduled cache drain. > > Your suggestion on no-stocks/caches in isolated CPUs would be great for > predictability, but I am almost sure the cost in overall performance would not > be fine. It is hard to estimate the overhead without measuring that. Do you think you can give it a try? If the performance is not really acceptable (which I would be really surprised) then we can think of a more complex solution. > With the possibility of prefaulting pages, do you see any scenario that would > introduce some undesirable latency in the workload? My primary concern would be spin lock contention which is hard to predict with something like remote draining.
On Mon, 2022-11-07 at 09:10 +0100, Michal Hocko wrote: > On Fri 04-11-22 22:45:58, Leonardo Brás wrote: > > On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote: > > > On Thu 03-11-22 13:53:41, Leonardo Brás wrote: > > > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote: > > > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote: > > > [...] > > > > > > I understand there will be a locking cost being paid in the isolated CPUs when: > > > > > > a) The isolated CPU is requesting the stock drain, > > > > > > b) When the isolated CPUs do a syscall and end up using the protected structure > > > > > > the first time after a remote drain. > > > > > > > > > > And anytime the charging path (consume_stock resp. refill_stock) > > > > > contends with the remote draining which is out of control of the RT > > > > > task. It is true that the RT kernel will turn that spin lock into a > > > > > sleeping RT lock and that could help with potential priority inversions > > > > > but still quite costly thing I would expect. > > > > > > > > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload > > > > > > should not expect the syscalls to be have a predictable time, so it should be > > > > > > fine. > > > > > > > > > > Now I am not sure I understand. If you do not consider charging path to > > > > > be RT sensitive then why is this needed in the first place? What else > > > > > would be populating the pcp cache on the isolated cpu? IRQs? > > > > > > > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at > > > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu > > > > time with the RT workload, we can have preemption of the RT workload, which is a > > > > problem for meeting the deadlines. > > > > > > Yes, this is understood. But it is not really clear to me why would any > > > draining be necessary for such an isolated CPU if no workload other than > > > the RT (which pressumably doesn't charge any memory?) is running on that > > > CPU? Is that the RT task during the initialization phase that leaves > > > that cache behind or something else? > > > > (I am new to this part of the code, so please correct me when I miss something.) > > > > IIUC, if a process belongs to a control group with memory control, the 'charge' > > will happen when a memory page starts getting used by it. > > Yes, very broadly speaking. > > > So, if we assume a RT load in a isolated CPU will not charge any memory, we are > > assuming it will never be part of a memory-controlled cgroup. > > If the memory cgroup controler is enabled then each user space process > is a part of some memcg. If there is no specific memcg assigned then it > will be a root cgroup and that is skipped during most charges except for > kmem. Oh, it makes sense. Thanks for helping me understand that! > > > I mean, can we just assume this? > > > > If I got that right, would not that be considered a limitation? like > > "If you don't want your workload to be interrupted by perCPU cache draining, > > don't put it in a cgroup with memory control". > > We definitely do not want userspace make any assumptions on internal > implementation details like caches. Perfect, that was my expectation. > > > > Sorry for being so focused on this > > > but I would like to understand on whether this is avoidable by a > > > different startup scheme or it really needs to be addressed in some way. > > > > No worries, I am in fact happy you are giving it this much attention :) > > > > I also understand this is a considerable change in the locking strategy, and > > avoiding that is the first thing that should be tried. > > > > > > > > > One way I thought to solve that was introducing a remote drain, which would > > > > require a different strategy for locking, since not all accesses to the pcp > > > > caches would happen on a local CPU. > > > > > > Yeah, I am not supper happy about additional spin lock TBH. One > > > potential way to go would be to completely avoid pcp cache for isolated > > > CPUs. That would have some performance impact of course but on the other > > > hand it would give a more predictable behavior for those CPUs which > > > sounds like a reasonable compromise to me. What do you think? > > > > You mean not having a perCPU stock, then? > > So consume_stock() for isolated CPUs would always return false, causing > > try_charge_memcg() always walking the slow path? > > Exactly. > > > IIUC, both my proposal and yours would degrade performance only when we use > > isolated CPUs + memcg. Is that correct? > > Yes, with a notable difference that with your spin lock option there is > still a chance that the remote draining could influence the isolated CPU > workload throug that said spinlock. If there is no pcp cache for that > cpu being used then there is no potential interaction at all. I see. But the slow path is slow for some reason, right? Does not it make use of any locks also? So on normal operation there could be a potentially larger impact than a spinlock, even though there would be no scheduled draining. > > > If so, it looks like the impact would be even bigger without perCPU stock , > > compared to introducing a spinlock. > > > > Unless, we are counting to this case where a remote CPU is draining an isolated > > CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be > > released in the remote CPU. Well, this seems possible to happen, but I would > > have to analyze how often would it happen, and how much would it impact the > > deadlines. I *guess* most of the RT workload's memory pages are pre-faulted > > before its starts, so it can avoid the faulting latency, but I need to confirm > > that. > > Yes, that is a general practice and the reason why I was asking how real > of a problem that is in practice. I remember this was one common factor on deadlines being missed in the workload analyzed. Need to redo the test to be sure. > It is true true that appart from user > space memory which can be under full control of the userspace there are > kernel allocations which can be done on behalf of the process and those > could be charged to memcg as well. So I can imagine the pcp cache could > be populated even if the process is not faulting anything in during RT > sensitive phase. Humm, I think I will apply the change and do a comparative testing with upstream. This should bring good comparison results. > > > On the other hand, compared to how it works now now, this should be a more > > controllable way of introducing latency than a scheduled cache drain. > > > > Your suggestion on no-stocks/caches in isolated CPUs would be great for > > predictability, but I am almost sure the cost in overall performance would not > > be fine. > > It is hard to estimate the overhead without measuring that. Do you think > you can give it a try? If the performance is not really acceptable > (which I would be really surprised) then we can think of a more complex > solution. Sure, I can try that. Do you suggest any specific workload that happens to stress the percpu cache usage, with usual drains and so? Maybe I will also try with synthetic worloads also. > > > With the possibility of prefaulting pages, do you see any scenario that would > > introduce some undesirable latency in the workload? > > My primary concern would be spin lock contention which is hard to > predict with something like remote draining. It makes sense. I will do some testing and come out with results for that. Thanks for reviewing! Leo
On Tue 08-11-22 20:09:25, Leonardo Brás wrote: [...] > > Yes, with a notable difference that with your spin lock option there is > > still a chance that the remote draining could influence the isolated CPU > > workload throug that said spinlock. If there is no pcp cache for that > > cpu being used then there is no potential interaction at all. > > I see. > But the slow path is slow for some reason, right? > Does not it make use of any locks also? So on normal operation there could be a > potentially larger impact than a spinlock, even though there would be no > scheduled draining. Well, for the regular (try_charge) path that is essentially page_counter_try_charge which boils down to atomic_long_add_return of the memcg counter + all parents up the hierarchy and high memory limit evaluation (essentially 2 atomic_reads for the memcg + all parents up the hierchy). That is not whole of a lot - especially when the memcg hierarchy is not very deep. Per cpu batch amortizes those per hierarchy updates as well as atomic operations + cache lines bouncing on updates. On the other hand spinlock would do the unconditional atomic updates as well and even much more on CONFIG_RT. A plus is that the update will be mostly local so cache line bouncing shouldn't be terrible. Unless somebody heavily triggers pcp cache draining but this shouldn't be all that common (e.g. when a memcg triggers its limit. All that being said, I am still not convinced that the pcp cache bypass for isolated CPUs would make a dramatic difference. Especially in the context of workloads that tend to run on isolated CPUs and rarely enter kernel. > > It is true true that appart from user > > space memory which can be under full control of the userspace there are > > kernel allocations which can be done on behalf of the process and those > > could be charged to memcg as well. So I can imagine the pcp cache could > > be populated even if the process is not faulting anything in during RT > > sensitive phase. > > Humm, I think I will apply the change and do a comparative testing with > upstream. This should bring good comparison results. That would be certainly appreciated! > > > On the other hand, compared to how it works now now, this should be a more > > > controllable way of introducing latency than a scheduled cache drain. > > > > > > Your suggestion on no-stocks/caches in isolated CPUs would be great for > > > predictability, but I am almost sure the cost in overall performance would not > > > be fine. > > > > It is hard to estimate the overhead without measuring that. Do you think > > you can give it a try? If the performance is not really acceptable > > (which I would be really surprised) then we can think of a more complex > > solution. > > Sure, I can try that. > Do you suggest any specific workload that happens to stress the percpu cache > usage, with usual drains and so? Maybe I will also try with synthetic worloads > also. I really think you want to test it on the isolcpu aware workload. Artificial benchmark are not all that useful in this context.
On Wed, 2022-11-09 at 09:05 +0100, Michal Hocko wrote: > On Tue 08-11-22 20:09:25, Leonardo Brás wrote: > [...] > > > Yes, with a notable difference that with your spin lock option there is > > > still a chance that the remote draining could influence the isolated CPU > > > workload throug that said spinlock. If there is no pcp cache for that > > > cpu being used then there is no potential interaction at all. > > > > I see. > > But the slow path is slow for some reason, right? > > Does not it make use of any locks also? So on normal operation there could be a > > potentially larger impact than a spinlock, even though there would be no > > scheduled draining. > > Well, for the regular (try_charge) path that is essentially page_counter_try_charge > which boils down to atomic_long_add_return of the memcg counter + all > parents up the hierarchy and high memory limit evaluation (essentially 2 > atomic_reads for the memcg + all parents up the hierchy). That is not > whole of a lot - especially when the memcg hierarchy is not very deep. > > Per cpu batch amortizes those per hierarchy updates as well as atomic > operations + cache lines bouncing on updates. > > On the other hand spinlock would do the unconditional atomic updates as > well and even much more on CONFIG_RT. A plus is that the update will be > mostly local so cache line bouncing shouldn't be terrible. Unless > somebody heavily triggers pcp cache draining but this shouldn't be all > that common (e.g. when a memcg triggers its limit. > > All that being said, I am still not convinced that the pcp cache bypass > for isolated CPUs would make a dramatic difference. Especially in the > context of workloads that tend to run on isolated CPUs and rarely enter > kernel. > > > > It is true true that appart from user > > > space memory which can be under full control of the userspace there are > > > kernel allocations which can be done on behalf of the process and those > > > could be charged to memcg as well. So I can imagine the pcp cache could > > > be populated even if the process is not faulting anything in during RT > > > sensitive phase. > > > > Humm, I think I will apply the change and do a comparative testing with > > upstream. This should bring good comparison results. > > That would be certainly appreciated! > ( > > > > On the other hand, compared to how it works now now, this should be a more > > > > controllable way of introducing latency than a scheduled cache drain. > > > > > > > > Your suggestion on no-stocks/caches in isolated CPUs would be great for > > > > predictability, but I am almost sure the cost in overall performance would not > > > > be fine. > > > > > > It is hard to estimate the overhead without measuring that. Do you think > > > you can give it a try? If the performance is not really acceptable > > > (which I would be really surprised) then we can think of a more complex > > > solution. > > > > Sure, I can try that. > > Do you suggest any specific workload that happens to stress the percpu cache > > usage, with usual drains and so? Maybe I will also try with synthetic worloads > > also. > > I really think you want to test it on the isolcpu aware workload. > Artificial benchmark are not all that useful in this context. Hello Michael, I just sent a v2 for this patchset with a lot of changes. https://lore.kernel.org/lkml/20230125073502.743446-1-leobras@redhat.com/ I have tried to gather some data on the performance numbers as suggested, but I got carried away and the cover letter ended up too big. I hope it's not too much trouble. Best regards, Leo