mbox series

[v2,0/5] Introduce memcg_stock_pcp remote draining

Message ID 20230125073502.743446-1-leobras@redhat.com (mailing list archive)
Headers show
Series Introduce memcg_stock_pcp remote draining | expand

Message

Leonardo Bras Jan. 25, 2023, 7:34 a.m. UTC
Disclaimer:
a - The cover letter got bigger than expected, so I had to split it in
    sections to better organize myself. I am not very confortable with it.
b - Performance numbers below did not include patch 5/5 (Remove flags
    from memcg_stock_pcp), which could further improve performance for
    drain_all_stock(), but I could only notice the optimization at the
    last minute.


0 - Motivation:
On current codebase, when drain_all_stock() is ran, it will schedule a
drain_local_stock() for each cpu that has a percpu stock associated with a
descendant of a given root_memcg.

This happens even on 'isolated cpus', a feature commonly used on workloads that
are sensitive to interruption and context switching such as vRAN and Industrial
Control Systems.

Since this scheduling behavior is a problem to those workloads, the proposal is
to replace the current local_lock + schedule_work_on() solution with a per-cpu
spinlock.

1 - Possible performance losses:
With a low amount of shedule_work_on(), local_locks are supposed to perform
better than spinlocks, given cacheline is always accessed by a single CPU and
there is no contention. The impact on those areas is analyzed bellow:

1.1 - Cacheline usage
In current implementation drain_all_stock() will be remote reading the percpu
memcg_stock cacheline of every online CPU, and remote writing to all cpus that
succeed the mem_cgroup_is_descendant() test. (stock->flags, stock->work)

With spinlocks, drain_all_stock() will be remote reading the percpu memcg_stock
cacheline of every online CPU, and remote writing to all cpus that succeed the
mem_cgroup_is_descendant() test. (stock->stock_lock, on top of the above)
While the spinlock may require extra acquire/release writes on some archs, they
will all happen on an exclusive cacheline, so not much overhead.

In both cases, the next local cpu read will require a fetch from memory (as it
was invalidated on the remote write) and cacheline exclusivity get before the
local write.

So about cacheline usage, there should not be much impact.

1.2 - Contention
We can safely assume that drain_all_stock() will not run oftenly. If it was not
the case, there would be a lot of scheduled tasks and kill cpu performance.

Since it does not run oftenly, and it is the only function that acesses remote
percpu memcg_stock, contention should not be too expressive, and should cause
less impact than scheduling (remote) and running the scheduled work (local). 


2 - Performance test results:
2.1 - Test System:
	- Non-virtualized AMD EPYC 7601 32-Core Processor, 128 CPUs distributed 
	in 8 NUMA nodes:  0-7,64-71 on node0, 8-15,72-79 on node1, and so on.
	- 256GB RAM, 2x32GB per NUMA node
	- For cpu isolation: use kernel cmdline:
	isolcpus=8-15,72-79 nohz_full=8-15,72-79
	- Memcg group created with:
		[Slice]
		MemoryAccounting=true
		MemoryLimit=1G
		MemoryMax=1G
		MemorySwapMax=1M
	- For pinning runs on given CPU, I used 'taskset -c $cpunum command'
	- For restarting the memcg, it was used:
	restart_memcg(){
	  systemctl stop test.slice
	  systemctl daemon-reload
	  systemctl start test.slice
	}

2.2 - Impact on functions that use memcg_stock:
	2.2.1 - Approach: Using perf or tracepoints get a very course result, so
	it was preferred to count the total cpu clocks between entering and 
	exiting the functions that use the memcg_stock, on an isolated cpu. 
	Something like this was used on x86:

	fun_name(){
		u64 clk = rdtsc_ordered();
		... <function does it's job>
		clk = rdtsc_ordered() - clk;
		<percpu struct dbg>
		dbg->fun_name.cycles += clk;
		dbg->fun_name.count++;
	}
	
	The percpu statistics were then acquired via a char device after the 
	test finished, and an average function clock usage was calculated.

	For the stress test, run "cat /proc/interrupts > /dev/null" in a loop
	of 1000000 iterations inside the memcg for each cpu tested:
	for each cpu in $cpuset; do 
		systemd-run --wait --slice=test.slice taskset -c $cpu bash -c "
		for k in {1..100000} ; do 
		  cat /proc/interrupts > /dev/null;
		done"
		
	For the drain_all_stock() test, it was necessary to restart the memcg
	(or cause an OOM) to call the function.
	
	2.2.2 - Results
	For 1 isolated CPU, pinned on cpu 8, with no drain_all_stock() calls,
	being STDDEV the standard deviation between the average on 6 runs,
	and Call Count the sum of calls on the 6 runs:
	
	Patched			Average clk	STDDEV		Call Count
	consume_stock:		63.75983475	0.1610502136	72167768
	refill_stock:		67.45708322	0.09732816852	23401756
	mod_objcg_state:	98.03841384	1.491628532	181292961
	consume_obj_stock:	63.2543456	0.04624513799	94846454
	refill_obj_stock:	78.56390025	0.3160306174	91732973
			
	Upstream		Average clk	STDDEV		Call Count
	consume_stock:		53.51201046	0.05200824438	71.866536
	refill_stock:		65.46991584	0.1178078417	23401752
	mod_objcg_state:	84.95365055	1.371464414	181.292707
	consume_obj_stock:	60.03619438	0.05944582207	94.846327
	refill_obj_stock:	73.23757912	1.161933856	91.732787
			
	Patched - Upstream	Diff (cycles)	Diff %	
	consume_stock:		10.24782429	19.15051258	
	refill_stock:		1.987167378	3.035237411	
	mod_objcg_state:	13.08476328	15.40223781	
	consume_obj_stock:	3.218151217	5.360351785	
	refill_obj_stock:	5.326321123	7.272661368	
	
	So in average the above patched functions are 2~13 clocks cycles slower
	than upstream.
	
	On the other hand, drain_all_stock is faster on the patched version,
	even considering it does all the draining instead of scheduling the work
	to other CPUs:
	
	drain_all_stock
	cpus	Upstream 	Patched		Diff (cycles)	Diff(%)
	1	44331.10831	38978.03581	-5353.072507	-12.07520567
	8	43992.96512	39026.76654	-4966.198572	-11.2886198
	128	156274.6634	58053.87421	-98220.78915	-62.85138425
	
	(8 cpus being in the same NUMA node)
		
2.3 - Contention numbers
	2.3.1 - Approach
	On top of the patched version, I replaced the spin_lock_irqsave() on 
	functions that use the memcg_stock with	spin_lock_irqsave_cc(), which
	is defined as:
	
	#define spin_lock_irqsave_cc(l, flags)				\
		if (!spin_trylock_irqsave(l, flags)) {			\
	              	u64 clk = rdtsc_ordered();			\
        	      	spin_lock_irqsave(l, flags);			\
        	     	clk = rdtsc_ordered() - clk;			\
        	      	pr_err("mydebug: cpu %d hit contention :" 	\
        	      	" fun = %s, clk = %lld/n",			\
        	      	smp_processor_id(), __func__, clk);	\
      	}

	So in case of contention (try_lock failing) it would record an 
	approximate clk usage before getting the lock, and print this to dmesg.

	For the stress test, run "cat /proc/interrupts > /dev/null" in a loop
	of 1000000 iterations for each of the 128 cpus inside the memcg 
	(limit set to 20G):
	
	for each cpu in {1..128}; do 
		restart_memcg()
		systemd-run --wait --slice=test.slice taskset -c $cpu bash -c "
		for k in {1..100000} ; do 
		  cat /proc/interrupts > /dev/null;
		done"

	This loop was repeated for over 8 hours
	
	2.3.2- Results
	
	Function		# calls
	consume_stock:		15078323802
	refill_stock:		2495995683
	mod_objcg_state:	39765559905
	consume_obj_stock:	19882888224
	refill_obj_stock:	21025241793
	drain_all_stock:	592
	
	Contentions hit: 0
	
	(Other more aggressive synthetic tests were run, and even in this case
	contention was hit just a couple times over some hours.)
	
2.4 - Syscall time measure
	2.4.1- Approach
	To measure the patchset effect on syscall time, the following code was
	used: (copied/adapted from
 https://lore.kernel.org/linux-mm/20220924152227.819815-1-atomlin@redhat.com/ )
 
 	################
 	#include <stdio.h>
	#include <stdlib.h>
	#include <sys/mman.h>
	#include <unistd.h>
	#include <string.h>

	typedef unsigned long long cycles_t;
	typedef unsigned long long usecs_t;
	typedef unsigned long long u64;

	#ifdef __x86_64__
	#define DECLARE_ARGS(val, low, high)    unsigned long low, high
	#define EAX_EDX_VAL(val, low, high)     ((low) | ((u64)(high) << 32))
	#define EAX_EDX_ARGS(val, low, high)    "a" (low), "d" (high)
	#define EAX_EDX_RET(val, low, high)     "=a" (low), "=d" (high)
	#else
	#define DECLARE_ARGS(val, low, high)    unsigned long long val
	#define EAX_EDX_VAL(val, low, high)     (val)
	#define EAX_EDX_ARGS(val, low, high)    "A" (val)
	#define EAX_EDX_RET(val, low, high)     "=A" (val)
	#endif

	static inline unsigned long long __rdtscll(void)
	{
		DECLARE_ARGS(val, low, high);

		asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));

		return EAX_EDX_VAL(val, low, high);
	}

	#define rdtscll(val) do { (val) = __rdtscll(); } while (0)

	#define NRSYSCALLS 30000000
	#define NRSLEEPS   100000

	#define page_mmap()     mmap(NULL, 4096, PROT_READ|PROT_WRITE, \
			       MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
	#define page_munmap(x)  munmap(x, 4096)

	void main(int argc, char *argv[])
	{
		unsigned long a, b, cycles;
		int i, syscall = 0;
		int *page;

		page = page_mmap();
		if (page == MAP_FAILED)
		        perror("mmap");

		if (page_munmap(page))
		        perror("munmap");

		if (argc != 2) {
		        printf("usage: %s {idle,syscall}\n", argv[0]);
		        exit(1);
		}

		rdtscll(a);

		if (strncmp("idle", argv[1], 4) == 0)
		        syscall = 0;
		else if (strncmp("syscall", argv[1], 7) == 0)
		        syscall = 1;
		else {
		      	printf("usage: %s {idle,syscall}\n", argv[0]);
		        exit(1);
		}

		if (syscall == 1) {
		        for (i = 0; i < NRSYSCALLS; i++) {
		                page = page_mmap();
		                if (page == MAP_FAILED)
		                        perror("mmap");

				#ifdef MY_WRITE
		                page[3] = i;
		                #endif

		                if (page_munmap(page))
		                        perror("munmap");
		        }
		} else {
		        for (i = 0; i < NRSLEEPS; i++)
		                usleep(10);
		}

		rdtscll(b);

		cycles = b - a;

		if (syscall == 1)
		        printf("cycles / syscall: %d\n", (b-a)/(NRSYSCALLS*2));
		else
		    	printf("cycles / idle loop: %d\n", (b-a)/NRSLEEPS);
	}
	################

	Running with ./my_test syscall will cause it to print the average clock
	cycles usage of the syscall pair (page_mmap() and page_munmap());
	
	It was compiled with two versions: With -DMY_WRITE and without it.
	The difference is writing to the allocated page, causing it to fault.
	
	Each version was run 200 times, pinned to an isolated cpu.
	Then an average and standard deviation was calculated on those results.

	2.4.2- Results
	Patched: no_write	write
	AVG	2991.195	5746.495
	STDEV	27.77488427	40.55878512
	STDEV %	0.9285547838	0.7058004073	
	
	Upstream: no_write	write
	AVG	3012.22		5749.605
	STDEV	25.1186451	37.26206223
	STDEV %	0.8338914522	0.6480803851

	Pat - Up: no_write	write
	Diff	-21.025		-3.11
	Diff %	-0.6979901866	-0.05409067232
	
	Meaning the pair page_mmap() + page_munmap() + pagefault runs a tiny bit
	faster on the patched version, compared to upstream.
	
3 - Discussion on results	
	On 2.2 we see every function that uses memcg_stock on local cpu gets a
	slower by some cycles on the patched version, while the function that
	accesses it remotely (drain_all_stock()) gets faster. The difference is
	more accentuated as we raise the cpu count, and consequently start
	dealing with sharing memory across NUMA.
	
	On 2.3 we see contention is not a big issue, as expected in 1.2.
	This probably happens due to the fact that drain_all_stock() runs quite
	rarely on normal operation, and the other functions are quite fast.
	
	On 2.4 we can see that page_mmap() + page_munmap() + pagefault ran 
	a tiny bit faster. This is probably due to the fact that
	drain_all_stock() does not schedule work on the running cpu, causing it
	not to get interrupted, and possibly making up for the increased time
	in local functions.
	
4- Conclusion
	Scheduling work on isolated cpus can be an issue for some workloads.
	Reducing the issue by replacing	the local_lock in memcg_stock with a
	spinlock should not cause much impact on performance.
	
Leonardo Bras (5):
  mm/memcontrol: Align percpu memcg_stock to cache
  mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t
  mm/memcontrol: Reorder memcg_stock_pcp members to avoid holes
  mm/memcontrol: Perform all stock drain in current CPU
  mm/memcontrol: Remove flags from memcg_stock_pcp

 mm/memcontrol.c | 75 ++++++++++++++++++++-----------------------------
 1 file changed, 31 insertions(+), 44 deletions(-)

Comments

Michal Hocko Jan. 25, 2023, 8:33 a.m. UTC | #1
On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> Disclaimer:
> a - The cover letter got bigger than expected, so I had to split it in
>     sections to better organize myself. I am not very confortable with it.
> b - Performance numbers below did not include patch 5/5 (Remove flags
>     from memcg_stock_pcp), which could further improve performance for
>     drain_all_stock(), but I could only notice the optimization at the
>     last minute.
> 
> 
> 0 - Motivation:
> On current codebase, when drain_all_stock() is ran, it will schedule a
> drain_local_stock() for each cpu that has a percpu stock associated with a
> descendant of a given root_memcg.
> 
> This happens even on 'isolated cpus', a feature commonly used on workloads that
> are sensitive to interruption and context switching such as vRAN and Industrial
> Control Systems.
> 
> Since this scheduling behavior is a problem to those workloads, the proposal is
> to replace the current local_lock + schedule_work_on() solution with a per-cpu
> spinlock.

If IIRC we have also discussed that isolated CPUs can simply opt out
from the pcp caching and therefore the problem would be avoided
altogether without changes to the locking scheme. I do not see anything
regarding that in this submission. Could you elaborate why you have
abandoned this option?
Leonardo Bras Jan. 25, 2023, 11:06 a.m. UTC | #2
On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > Disclaimer:
> > a - The cover letter got bigger than expected, so I had to split it in
> >     sections to better organize myself. I am not very confortable with it.
> > b - Performance numbers below did not include patch 5/5 (Remove flags
> >     from memcg_stock_pcp), which could further improve performance for
> >     drain_all_stock(), but I could only notice the optimization at the
> >     last minute.
> > 
> > 
> > 0 - Motivation:
> > On current codebase, when drain_all_stock() is ran, it will schedule a
> > drain_local_stock() for each cpu that has a percpu stock associated with a
> > descendant of a given root_memcg.
> > 
> > This happens even on 'isolated cpus', a feature commonly used on workloads that
> > are sensitive to interruption and context switching such as vRAN and Industrial
> > Control Systems.
> > 
> > Since this scheduling behavior is a problem to those workloads, the proposal is
> > to replace the current local_lock + schedule_work_on() solution with a per-cpu
> > spinlock.
> 
> If IIRC we have also discussed that isolated CPUs can simply opt out
> from the pcp caching and therefore the problem would be avoided
> altogether without changes to the locking scheme. I do not see anything
> regarding that in this submission. Could you elaborate why you have
> abandoned this option?

Hello Michal,

I understand pcp caching is a nice to have.
So while I kept the idea of disabling pcp caching in mind as an option, I first
tried to understand what kind of impacts we would be seeing when trying to
change the locking scheme.

After I raised the data in the cover letter, I found that the performance impact
appears not be that big. So in order to try keeping the pcp cache on isolated
cpus active, I decided to focus effort on the locking scheme change.

I mean, my rationale is: if is there a non-expensive way of keeping the feature,
why should we abandon it?

Best regards,
Leo
Michal Hocko Jan. 25, 2023, 11:39 a.m. UTC | #3
On Wed 25-01-23 08:06:46, Leonardo Brás wrote:
> On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > Disclaimer:
> > > a - The cover letter got bigger than expected, so I had to split it in
> > >     sections to better organize myself. I am not very confortable with it.
> > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > >     from memcg_stock_pcp), which could further improve performance for
> > >     drain_all_stock(), but I could only notice the optimization at the
> > >     last minute.
> > > 
> > > 
> > > 0 - Motivation:
> > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > descendant of a given root_memcg.
> > > 
> > > This happens even on 'isolated cpus', a feature commonly used on workloads that
> > > are sensitive to interruption and context switching such as vRAN and Industrial
> > > Control Systems.
> > > 
> > > Since this scheduling behavior is a problem to those workloads, the proposal is
> > > to replace the current local_lock + schedule_work_on() solution with a per-cpu
> > > spinlock.
> > 
> > If IIRC we have also discussed that isolated CPUs can simply opt out
> > from the pcp caching and therefore the problem would be avoided
> > altogether without changes to the locking scheme. I do not see anything
> > regarding that in this submission. Could you elaborate why you have
> > abandoned this option?
> 
> Hello Michal,
> 
> I understand pcp caching is a nice to have.
> So while I kept the idea of disabling pcp caching in mind as an option, I first
> tried to understand what kind of impacts we would be seeing when trying to
> change the locking scheme.
> 
> After I raised the data in the cover letter, I found that the performance impact
> appears not be that big. So in order to try keeping the pcp cache on isolated
> cpus active, I decided to focus effort on the locking scheme change.
> 
> I mean, my rationale is: if is there a non-expensive way of keeping the feature,
> why should we abandon it?

Because any locking to pcp adds a potential contention. You haven't been
able to trigger that contention by your testing but that doesn't really
mean it is non-existent.

Besided that opt-out for isolated cpus should be a much smaller change
with a much more predictable behavior. The overall performance for
charging on isolated cpus will be slightly worse but it hasn't been seen
this is anything really noticeable. Most workloads which do care about
isolcpus tend to not enter kernel much AFAIK so this should have
relatively small impact.

All that being said I would prefer a simpler solution which seems to be
to simply opt out from pcp caching and if the performance for real
world workloads shows regressions then we can start thinking about a
more complex solution.
Marcelo Tosatti Jan. 25, 2023, 6:22 p.m. UTC | #4
On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > Disclaimer:
> > > a - The cover letter got bigger than expected, so I had to split it in
> > >     sections to better organize myself. I am not very confortable with it.
> > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > >     from memcg_stock_pcp), which could further improve performance for
> > >     drain_all_stock(), but I could only notice the optimization at the
> > >     last minute.
> > > 
> > > 
> > > 0 - Motivation:
> > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > descendant of a given root_memcg.
> > > 
> > > This happens even on 'isolated cpus', a feature commonly used on workloads that
> > > are sensitive to interruption and context switching such as vRAN and Industrial
> > > Control Systems.
> > > 
> > > Since this scheduling behavior is a problem to those workloads, the proposal is
> > > to replace the current local_lock + schedule_work_on() solution with a per-cpu
> > > spinlock.
> > 
> > If IIRC we have also discussed that isolated CPUs can simply opt out
> > from the pcp caching and therefore the problem would be avoided
> > altogether without changes to the locking scheme. I do not see anything
> > regarding that in this submission. Could you elaborate why you have
> > abandoned this option?
> 
> Hello Michal,
> 
> I understand pcp caching is a nice to have.
> So while I kept the idea of disabling pcp caching in mind as an option, I first
> tried to understand what kind of impacts we would be seeing when trying to
> change the locking scheme.

Remote draining reduces interruptions whether CPU 
is marked as isolated or not:

- Allows isolated CPUs from benefiting of pcp caching.
- Removes the interruption to non isolated CPUs. See for example 

https://lkml.org/lkml/2022/6/13/2769

"Minchan Kim tested this independently and reported;

       My workload is not NOHZ CPUs but run apps under heavy memory
       pressure so they goes to direct reclaim and be stuck on
       drain_all_pages until work on workqueue run.

       unit: nanosecond
       max(dur)        avg(dur)                count(dur)
       166713013       487511.77786438033      1283

       From traces, system encountered the drain_all_pages 1283 times and
       worst case was 166ms and avg was 487us.

       The other problem was alloc_contig_range in CMA. The PCP draining
       takes several hundred millisecond sometimes though there is no
       memory pressure or a few of pages to be migrated out but CPU were
       fully booked.

       Your patch perfectly removed those wasted time."


> After I raised the data in the cover letter, I found that the performance impact
> appears not be that big. So in order to try keeping the pcp cache on isolated
> cpus active, I decided to focus effort on the locking scheme change.
> 
> I mean, my rationale is: if is there a non-expensive way of keeping the feature,
> why should we abandon it?
> 
> Best regards,
> Leo
> 
> 
> 
> 
> 
> 
> 
>
Roman Gushchin Jan. 25, 2023, 11:14 p.m. UTC | #5
On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > Disclaimer:
> > > > a - The cover letter got bigger than expected, so I had to split it in
> > > >     sections to better organize myself. I am not very confortable with it.
> > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > >     from memcg_stock_pcp), which could further improve performance for
> > > >     drain_all_stock(), but I could only notice the optimization at the
> > > >     last minute.
> > > > 
> > > > 
> > > > 0 - Motivation:
> > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > descendant of a given root_memcg.

Do you know what caused those drain_all_stock() calls? I wonder if we should look
into why we have many of them and whether we really need them?

It's either some user's actions (e.g. reducing memory.max), either some memcg
is entering pre-oom conditions. In the latter case a lot of drain calls can be
scheduled without a good reason (assuming the cgroup contain multiple tasks running
on multiple cpus). Essentially each cpu will try to grab the remains of the memory quota
and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
on per-cgroup basis.

Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
charges/memcg references (it might be not completely idle, but running some very special
workload which is not doing any kernel allocations or a process belonging to the root memcg).
In all other cases pcpu stock will be either drained naturally by an allocation from another
memcg or an allocation from the same memcg will "restore" it, making draining useless.

We also can into drain_all_pages() opportunistically, without waiting for the result.
On a busy system it's most likely useless, we might oom before scheduled works will be executed.

I admit I planned to do some work around and even started, but then never had enough time to
finish it.

Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
doing this without a really strong reason.

Thanks!
Hillf Danton Jan. 26, 2023, 2:01 a.m. UTC | #6
On Wed, 25 Jan 2023 15:22:00 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> 
> Remote draining reduces interruptions whether CPU 
> is marked as isolated or not:
> 
> - Allows isolated CPUs from benefiting of pcp caching.
> - Removes the interruption to non isolated CPUs. See for example 

Why ask refill to take a pill because drain got a cough?
> 
> https://lkml.org/lkml/2022/6/13/2769
> 
> "Minchan Kim tested this independently and reported;
> 
>        My workload is not NOHZ CPUs but run apps under heavy memory
>        pressure so they goes to direct reclaim and be stuck on
>        drain_all_pages until work on workqueue run.

What sense are you trying to make by getting CPUs isolated and equipped
with tight memory?
> 
>        unit: nanosecond
>        max(dur)        avg(dur)                count(dur)
>        166713013       487511.77786438033      1283
> 
>        From traces, system encountered the drain_all_pages 1283 times and
>        worst case was 166ms and avg was 487us.
> 
>        The other problem was alloc_contig_range in CMA. The PCP draining
>        takes several hundred millisecond sometimes though there is no
>        memory pressure or a few of pages to be migrated out but CPU were
>        fully booked.
> 
>        Your patch perfectly removed those wasted time."
Michal Hocko Jan. 26, 2023, 7:41 a.m. UTC | #7
On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > Disclaimer:
> > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > >     sections to better organize myself. I am not very confortable with it.
> > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > >     from memcg_stock_pcp), which could further improve performance for
> > > > >     drain_all_stock(), but I could only notice the optimization at the
> > > > >     last minute.
> > > > > 
> > > > > 
> > > > > 0 - Motivation:
> > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > descendant of a given root_memcg.
> 
> Do you know what caused those drain_all_stock() calls? I wonder if we should look
> into why we have many of them and whether we really need them?
> 
> It's either some user's actions (e.g. reducing memory.max), either some memcg
> is entering pre-oom conditions. In the latter case a lot of drain calls can be
> scheduled without a good reason (assuming the cgroup contain multiple tasks running
> on multiple cpus).

I believe I've never got a specific answer to that. We
have discussed that in the previous version submission
(20221102020243.522358-1-leobras@redhat.com and specifically
Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and
isolcpus. I was wondering about using memcgs in RT workloads because
that just sounds weird but let's say this is the case indeed. Then an RT
task or whatever task that is running on an isolated cpu can have pcp
charges.

> Essentially each cpu will try to grab the remains of the memory quota
> and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> on per-cgroup basis.

I think it would be more than sufficient to disable pcp charging on an
isolated cpu. This is not a per memcg property. I can imagine that
different tasks running in the same memcg can run on a mix of CPUs (e.g.
only part of it on isolated CPUs). It is a recipe for all sorts of
priority inversions but well, memcg and RT is there already.

> Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
> charges/memcg references (it might be not completely idle, but running some very special
> workload which is not doing any kernel allocations or a process belonging to the root memcg).
> In all other cases pcpu stock will be either drained naturally by an allocation from another
> memcg or an allocation from the same memcg will "restore" it, making draining useless.
> 
> We also can into drain_all_pages() opportunistically, without waiting for the result.
> On a busy system it's most likely useless, we might oom before scheduled works will be executed.

I think the primary objective is that no userspace unintended execution
happens on isolated cpus.
 
> I admit I planned to do some work around and even started, but then never had enough time to
> finish it.
> 
> Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> doing this without a really strong reason.

Are you OK with a simple opt out on isolated CPUs? That would make
charges slightly slower (atomic on the hierarchy counters vs. a single
pcp adjustment) but it would guarantee that the isolated workload is
predictable which is the primary objective AFAICS.
Michal Hocko Jan. 26, 2023, 7:45 a.m. UTC | #8
On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
[...]
> Remote draining reduces interruptions whether CPU 
> is marked as isolated or not:
> 
> - Allows isolated CPUs from benefiting of pcp caching.
> - Removes the interruption to non isolated CPUs. See for example 
> 
> https://lkml.org/lkml/2022/6/13/2769

This is talking about page allocato per cpu caches, right? In this patch
we are talking about memcg pcp caches. Are you sure the same applies
here?
Marcelo Tosatti Jan. 26, 2023, 6:03 p.m. UTC | #9
On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > Disclaimer:
> > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > >     sections to better organize myself. I am not very confortable with it.
> > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > >     from memcg_stock_pcp), which could further improve performance for
> > > > > >     drain_all_stock(), but I could only notice the optimization at the
> > > > > >     last minute.
> > > > > > 
> > > > > > 
> > > > > > 0 - Motivation:
> > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > descendant of a given root_memcg.
> > 
> > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > into why we have many of them and whether we really need them?
> > 
> > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > on multiple cpus).
> 
> I believe I've never got a specific answer to that. We
> have discussed that in the previous version submission
> (20221102020243.522358-1-leobras@redhat.com and specifically
> Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and
> isolcpus. I was wondering about using memcgs in RT workloads because
> that just sounds weird but let's say this is the case indeed. 

This could be the case. You can consider an "edge device" where it is
necessary to run a RT workload. It might also be useful to run 
non realtime applications on the same system.

> Then an RT task or whatever task that is running on an isolated
> cpu can have pcp charges.

Usually the RT task (or more specifically the realtime sensitive loop
of the application) runs entirely on userspace. But i suppose there
could be charges on application startup.

> > Essentially each cpu will try to grab the remains of the memory quota
> > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > on per-cgroup basis.
> 
> I think it would be more than sufficient to disable pcp charging on an
> isolated cpu. This is not a per memcg property. I can imagine that
> different tasks running in the same memcg can run on a mix of CPUs (e.g.
> only part of it on isolated CPUs). It is a recipe for all sorts of
> priority inversions but well, memcg and RT is there already.

I suppose the more general the solution, the better.

> > Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
> > charges/memcg references (it might be not completely idle, but running some very special
> > workload which is not doing any kernel allocations or a process belonging to the root memcg).
> > In all other cases pcpu stock will be either drained naturally by an allocation from another
> > memcg or an allocation from the same memcg will "restore" it, making draining useless.
> > 
> > We also can into drain_all_pages() opportunistically, without waiting for the result.
> > On a busy system it's most likely useless, we might oom before scheduled works will be executed.
> 
> I think the primary objective is that no userspace unintended execution
> happens on isolated cpus.

No interruptions to the userspace code (time sensitive code) running on
isolated CPUs: no IPIs, no task switches.

> > I admit I planned to do some work around and even started, but then never had enough time to
> > finish it.
> > 
> > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> > doing this without a really strong reason.
> 
> Are you OK with a simple opt out on isolated CPUs? That would make
> charges slightly slower (atomic on the hierarchy counters vs. a single
> pcp adjustment) but it would guarantee that the isolated workload is
> predictable which is the primary objective AFAICS.

This would make isolated CPUs "second class citizens": it would be nice
to be able to execute non realtime apps on isolated CPUs as well
(think of different periods of time during a day, one where 
more realtime apps are required, another where less 
realtime apps are required).

Concrete example: think of a computer handling vRAN traffic near a 
cell tower. The traffic (therefore amount of processing required
by realtime applications) might vary during the day.

User might want to run containers that depend on good memcg charging
performance on isolated CPUs.
Marcelo Tosatti Jan. 26, 2023, 6:14 p.m. UTC | #10
On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> [...]
> > Remote draining reduces interruptions whether CPU 
> > is marked as isolated or not:
> > 
> > - Allows isolated CPUs from benefiting of pcp caching.
> > - Removes the interruption to non isolated CPUs. See for example 
> > 
> > https://lkml.org/lkml/2022/6/13/2769
> 
> This is talking about page allocato per cpu caches, right? In this patch
> we are talking about memcg pcp caches. Are you sure the same applies
> here?

Both can stall the users of the drain operation.

"Minchan Kim tested this independently and reported;

	My workload is not NOHZ CPUs but run apps under heavy memory
	pressure so they goes to direct reclaim and be stuck on
	drain_all_pages until work on workqueue run."

Therefore using a workqueue to drain memcg pcps also depends on the 
remote CPU executing that work item in time (which can stall
the following). No?

===

   7   3141  mm/memory.c <<wp_page_copy>>
             if (mem_cgroup_charge(page_folio(new_page), mm, GFP_KERNEL))
   8   4118  mm/memory.c <<do_anonymous_page>>
             if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
   9   4577  mm/memory.c <<do_cow_fault>>
             if (mem_cgroup_charge(page_folio(vmf->cow_page), vma->vm_mm,
  10    621  mm/migrate_device.c <<migrate_vma_insert_page>>
             if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
  11    710  mm/shmem.c <<shmem_add_to_page_cache>>
             error = mem_cgroup_charge(folio, charge_mm, gfp);
Marcelo Tosatti Jan. 26, 2023, 6:19 p.m. UTC | #11
On Wed, Jan 25, 2023 at 03:14:48PM -0800, Roman Gushchin wrote:
> On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > Disclaimer:
> > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > >     sections to better organize myself. I am not very confortable with it.
> > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > >     from memcg_stock_pcp), which could further improve performance for
> > > > >     drain_all_stock(), but I could only notice the optimization at the
> > > > >     last minute.
> > > > > 
> > > > > 
> > > > > 0 - Motivation:
> > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > descendant of a given root_memcg.
> 
> Do you know what caused those drain_all_stock() calls? I wonder if we should look
> into why we have many of them and whether we really need them?
> 
> It's either some user's actions (e.g. reducing memory.max), either some memcg
> is entering pre-oom conditions. In the latter case a lot of drain calls can be
> scheduled without a good reason (assuming the cgroup contain multiple tasks running
> on multiple cpus). Essentially each cpu will try to grab the remains of the memory quota
> and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> on per-cgroup basis.
> 
> Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
> charges/memcg references (it might be not completely idle, but running some very special
> workload which is not doing any kernel allocations or a process belonging to the root memcg).
> In all other cases pcpu stock will be either drained naturally by an allocation from another
> memcg or an allocation from the same memcg will "restore" it, making draining useless.
> 
> We also can into drain_all_pages() opportunistically, without waiting for the result.
> On a busy system it's most likely useless, we might oom before scheduled works will be executed.
> 
> I admit I planned to do some work around and even started, but then never had enough time to
> finish it.
> 
> Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> doing this without a really strong reason.

The expectation would be that cache locking should not cause slowdown of
the allocation and free paths:

https://manualsbrain.com/en/manuals/1246877/?page=313

For the P6 and more recent processor families, if the area of memory being locked 
during a LOCK operation is cached in the processor that is performing the LOCK oper-
ation as write-back memory and is completely contained in a cache line, the 
processor may not assert the LOCK# signal on the bus. Instead, it will modify the 
memory location internally and allow it’s cache coherency mechanism to insure that 
the operation is carried out atomically. This operation is called “cache locking.” The 
cache coherency mechanism automatically prevents two or more processors that ...
Michal Hocko Jan. 26, 2023, 7:13 p.m. UTC | #12
On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > [...]
> > > Remote draining reduces interruptions whether CPU 
> > > is marked as isolated or not:
> > > 
> > > - Allows isolated CPUs from benefiting of pcp caching.
> > > - Removes the interruption to non isolated CPUs. See for example 
> > > 
> > > https://lkml.org/lkml/2022/6/13/2769
> > 
> > This is talking about page allocato per cpu caches, right? In this patch
> > we are talking about memcg pcp caches. Are you sure the same applies
> > here?
> 
> Both can stall the users of the drain operation.

Yes. But it is important to consider who those users are. We are
draining when
	- we are charging and the limit is hit so that memory reclaim
	  has to be triggered.
	- hard, high limits are set and require memory reclaim.
	- force_empty - full memory reclaim for a memcg
	- memcg offlining - cgroup removel - quite a heavy operation as
	  well.
all those could be really costly kernel operations and they affect
isolated cpu only if the same memcg is used by both isolated and non-isolated
cpus. In other words those costly operations would have to be triggered
from non-isolated cpus and those are to be expected to be stalled. It is
the side effect of the local cpu draining that is scheduled that affects
the isolated cpu as well.

Is that more clear?
Michal Hocko Jan. 26, 2023, 7:20 p.m. UTC | #13
On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote:
> On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > > Disclaimer:
> > > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > >     sections to better organize myself. I am not very confortable with it.
> > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > >     from memcg_stock_pcp), which could further improve performance for
> > > > > > >     drain_all_stock(), but I could only notice the optimization at the
> > > > > > >     last minute.
> > > > > > > 
> > > > > > > 
> > > > > > > 0 - Motivation:
> > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > > descendant of a given root_memcg.
> > > 
> > > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > > into why we have many of them and whether we really need them?
> > > 
> > > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > > on multiple cpus).
> > 
> > I believe I've never got a specific answer to that. We
> > have discussed that in the previous version submission
> > (20221102020243.522358-1-leobras@redhat.com and specifically
> > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and
> > isolcpus. I was wondering about using memcgs in RT workloads because
> > that just sounds weird but let's say this is the case indeed. 
> 
> This could be the case. You can consider an "edge device" where it is
> necessary to run a RT workload. It might also be useful to run 
> non realtime applications on the same system.
> 
> > Then an RT task or whatever task that is running on an isolated
> > cpu can have pcp charges.
> 
> Usually the RT task (or more specifically the realtime sensitive loop
> of the application) runs entirely on userspace. But i suppose there
> could be charges on application startup.

What is the role of memcg then? If the memory limit is in place and the
workload doesn't fit in then it will get reclaimed during start up and
memory would need to be refaulted if not mlocked. If it is mlocked then
the limit cannot be enforced and the start up would likely fail as a
result of the memcg oom killer.

[...]
> > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> > > doing this without a really strong reason.
> > 
> > Are you OK with a simple opt out on isolated CPUs? That would make
> > charges slightly slower (atomic on the hierarchy counters vs. a single
> > pcp adjustment) but it would guarantee that the isolated workload is
> > predictable which is the primary objective AFAICS.
> 
> This would make isolated CPUs "second class citizens": it would be nice
> to be able to execute non realtime apps on isolated CPUs as well
> (think of different periods of time during a day, one where 
> more realtime apps are required, another where less 
> realtime apps are required).

An alternative requires to make the current implementation that is
lockless to use locks and introduce potential lock contention. This
could be harmful to regular workloads. Not using pcp caching would make
a fast path using few atomics rather than local pcp update. That is not
a terrible cost to pay for special cased workloads which use isolcpus.
Really we are not talking about a massive cost to be payed. At least
nobody has shown that in any numbers.

> Concrete example: think of a computer handling vRAN traffic near a 
> cell tower. The traffic (therefore amount of processing required
> by realtime applications) might vary during the day.
> 
> User might want to run containers that depend on good memcg charging
> performance on isolated CPUs.

What kind of role would memcg play here? Do you want to memory constrain
that workload?
Roman Gushchin Jan. 26, 2023, 11:12 p.m. UTC | #14
On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > Disclaimer:
> > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > >     sections to better organize myself. I am not very confortable with it.
> > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > >     from memcg_stock_pcp), which could further improve performance for
> > > > > >     drain_all_stock(), but I could only notice the optimization at the
> > > > > >     last minute.
> > > > > > 
> > > > > > 
> > > > > > 0 - Motivation:
> > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > descendant of a given root_memcg.
> > 
> > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > into why we have many of them and whether we really need them?
> > 
> > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > on multiple cpus).
> 
> I believe I've never got a specific answer to that. We
> have discussed that in the previous version submission
> (20221102020243.522358-1-leobras@redhat.com and specifically
> Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and
> isolcpus. I was wondering about using memcgs in RT workloads because
> that just sounds weird but let's say this is the case indeed. Then an RT
> task or whatever task that is running on an isolated cpu can have pcp
> charges.
> 
> > Essentially each cpu will try to grab the remains of the memory quota
> > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > on per-cgroup basis.
> 
> I think it would be more than sufficient to disable pcp charging on an
> isolated cpu.

It might have significant performance consequences.

I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
the accuracy of memory limits and slightly increase the memory footprint (all
those dying memcgs...), but the impact will be limited. Actually it is limited
by the number of cpus.

> This is not a per memcg property.

Sure, my point was that in pre-oom condition several cpus might try to consolidate
the remains of the memory quota, actually working against each other. Separate issue,
which might be a reason why there are many flush attempts in the case we discuss.

Thanks!
Marcelo Tosatti Jan. 27, 2023, 12:32 a.m. UTC | #15
On Thu, Jan 26, 2023 at 08:20:46PM +0100, Michal Hocko wrote:
> On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote:
> > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > > > Disclaimer:
> > > > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > > >     sections to better organize myself. I am not very confortable with it.
> > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > > >     from memcg_stock_pcp), which could further improve performance for
> > > > > > > >     drain_all_stock(), but I could only notice the optimization at the
> > > > > > > >     last minute.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 0 - Motivation:
> > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > > > descendant of a given root_memcg.
> > > > 
> > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > > > into why we have many of them and whether we really need them?
> > > > 
> > > > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > > > on multiple cpus).
> > > 
> > > I believe I've never got a specific answer to that. We
> > > have discussed that in the previous version submission
> > > (20221102020243.522358-1-leobras@redhat.com and specifically
> > > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and
> > > isolcpus. I was wondering about using memcgs in RT workloads because
> > > that just sounds weird but let's say this is the case indeed. 
> > 
> > This could be the case. You can consider an "edge device" where it is
> > necessary to run a RT workload. It might also be useful to run 
> > non realtime applications on the same system.
> > 
> > > Then an RT task or whatever task that is running on an isolated
> > > cpu can have pcp charges.
> > 
> > Usually the RT task (or more specifically the realtime sensitive loop
> > of the application) runs entirely on userspace. But i suppose there
> > could be charges on application startup.
> 
> What is the role of memcg then? If the memory limit is in place and the
> workload doesn't fit in then it will get reclaimed during start up and
> memory would need to be refaulted if not mlocked. If it is mlocked then
> the limit cannot be enforced and the start up would likely fail as a
> result of the memcg oom killer.

1) Application which is not time sensitive executes on isolated CPU,
with memcg control enabled. Per-CPU stock is created.

2) App with memcg control enabled exits, per-CPU stock is not drained.

3) Latency sensitive application starts, isolated per-CPU has stock to
be drained, and:

/*
 * Drains all per-CPU charge caches for given root_memcg resp. subtree
 * of the hierarchy under it.
 */
static void drain_all_stock(struct mem_cgroup *root_memcg)
{
        int cpu, curcpu;

        /* If someone's already draining, avoid adding running more workers. */
        if (!mutex_trylock(&percpu_charge_mutex))
                return;
        /*
         * Notify other cpus that system-wide "drain" is running
         * We do not care about races with the cpu hotplug because cpu down
         * as well as workers from this path always operate on the local
         * per-cpu data. CPU up doesn't touch memcg_stock at all.
         */
        migrate_disable();
        curcpu = smp_processor_id();
        for_each_online_cpu(cpu) {
                struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
                struct mem_cgroup *memcg;
                bool flush = false;

                rcu_read_lock();
                memcg = stock->cached;
                if (memcg && stock->nr_pages &&
                    mem_cgroup_is_descendant(memcg, root_memcg))
                        flush = true;
                else if (obj_stock_flush_required(stock, root_memcg))
                        flush = true;
                rcu_read_unlock();

                if (flush &&
                    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
                        if (cpu == curcpu)
                                drain_local_stock(&stock->work);
                        else
                                schedule_work_on(cpu, &stock->work);
                }
        }
        migrate_enable();
        mutex_unlock(&percpu_charge_mutex);
}

> [...]
> > > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> > > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> > > > doing this without a really strong reason.
> > > 
> > > Are you OK with a simple opt out on isolated CPUs? That would make
> > > charges slightly slower (atomic on the hierarchy counters vs. a single
> > > pcp adjustment) but it would guarantee that the isolated workload is
> > > predictable which is the primary objective AFAICS.
> > 
> > This would make isolated CPUs "second class citizens": it would be nice
> > to be able to execute non realtime apps on isolated CPUs as well
> > (think of different periods of time during a day, one where 
> > more realtime apps are required, another where less 
> > realtime apps are required).
> 
> An alternative requires to make the current implementation that is
> lockless to use locks and introduce potential lock contention. This
> could be harmful to regular workloads. Not using pcp caching would make
> a fast path using few atomics rather than local pcp update. That is not
> a terrible cost to pay for special cased workloads which use isolcpus.
> Really we are not talking about a massive cost to be payed. At least
> nobody has shown that in any numbers.
> 
> > Concrete example: think of a computer handling vRAN traffic near a 
> > cell tower. The traffic (therefore amount of processing required
> > by realtime applications) might vary during the day.
> > 
> > User might want to run containers that depend on good memcg charging
> > performance on isolated CPUs.
> 
> What kind of role would memcg play here? Do you want to memory constrain
> that workload?

See example above.

> -- 
> Michal Hocko
> SUSE Labs
> 
>
Leonardo Bras Jan. 27, 2023, 5:40 a.m. UTC | #16
On Thu, 2023-01-26 at 15:19 -0300, Marcelo Tosatti wrote:
> On Wed, Jan 25, 2023 at 03:14:48PM -0800, Roman Gushchin wrote:
> > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > Disclaimer:
> > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > >     sections to better organize myself. I am not very confortable with it.
> > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > >     from memcg_stock_pcp), which could further improve performance for
> > > > > >     drain_all_stock(), but I could only notice the optimization at the
> > > > > >     last minute.
> > > > > > 
> > > > > > 
> > > > > > 0 - Motivation:
> > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > descendant of a given root_memcg.
> > 
> > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > into why we have many of them and whether we really need them?
> > 
> > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > on multiple cpus). Essentially each cpu will try to grab the remains of the memory quota
> > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > on per-cgroup basis.
> > 
> > Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
> > charges/memcg references (it might be not completely idle, but running some very special
> > workload which is not doing any kernel allocations or a process belonging to the root memcg).
> > In all other cases pcpu stock will be either drained naturally by an allocation from another
> > memcg or an allocation from the same memcg will "restore" it, making draining useless.
> > 
> > We also can into drain_all_pages() opportunistically, without waiting for the result.
> > On a busy system it's most likely useless, we might oom before scheduled works will be executed.
> > 
> > I admit I planned to do some work around and even started, but then never had enough time to
> > finish it.
> > 
> > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> > doing this without a really strong reason.
> 
> The expectation would be that cache locking should not cause slowdown of
> the allocation and free paths:
> 
> https://manualsbrain.com/en/manuals/1246877/?page=313
> 
> For the P6 and more recent processor families, if the area of memory being locked 
> during a LOCK operation is cached in the processor that is performing the LOCK oper-
> ation as write-back memory and is completely contained in a cache line, the 
> processor may not assert the LOCK# signal on the bus. Instead, it will modify the 
> memory location internally and allow it’s cache coherency mechanism to insure that 
> the operation is carried out atomically. This operation is called “cache locking.” The 
> cache coherency mechanism automatically prevents two or more processors that ...
> 
> 

Just to keep the info easily available: the protected structure (struct
memcg_stock_pcp) fits in 48 Bytes, which is less than the usual 64B cacheline. 

struct memcg_stock_pcp {
	spinlock_t                 stock_lock;           /*     0     4 */
	unsigned int               nr_pages;             /*     4     4 */
	struct mem_cgroup *        cached;               /*     8     8 */
	struct obj_cgroup *        cached_objcg;         /*    16     8 */
	struct pglist_data *       cached_pgdat;         /*    24     8 */
	unsigned int               nr_bytes;             /*    32     4 */
	int                        nr_slab_reclaimable_b; /*    36     4 */
	int                        nr_slab_unreclaimable_b; /*    40     4 */

	/* size: 48, cachelines: 1, members: 8 */
	/* padding: 4 */
	/* last cacheline: 48 bytes */
};

(It got smaller after patches 3/5, 4/5 and 5/5, which remove holes, work_struct
and flags respectively.)

On top of that, patch 1/5 makes sure the percpu allocation is aligned to
cacheline size.
Leonardo Bras Jan. 27, 2023, 6:55 a.m. UTC | #17
On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote:
> On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > > [...]
> > > > Remote draining reduces interruptions whether CPU 
> > > > is marked as isolated or not:
> > > > 
> > > > - Allows isolated CPUs from benefiting of pcp caching.
> > > > - Removes the interruption to non isolated CPUs. See for example 
> > > > 
> > > > https://lkml.org/lkml/2022/6/13/2769
> > > 
> > > This is talking about page allocato per cpu caches, right? In this patch
> > > we are talking about memcg pcp caches. Are you sure the same applies
> > > here?
> > 
> > Both can stall the users of the drain operation.
> 
> Yes. But it is important to consider who those users are. We are
> draining when
> 	- we are charging and the limit is hit so that memory reclaim
> 	  has to be triggered.
> 	- hard, high limits are set and require memory reclaim.
> 	- force_empty - full memory reclaim for a memcg
> 	- memcg offlining - cgroup removel - quite a heavy operation as
> 	  well.
> all those could be really costly kernel operations and they affect
> isolated cpu only if the same memcg is used by both isolated and non-isolated
> cpus. In other words those costly operations would have to be triggered
> from non-isolated cpus and those are to be expected to be stalled. It is
> the side effect of the local cpu draining that is scheduled that affects
> the isolated cpu as well.
> 
> Is that more clear?

I think so, please help me check:

IIUC, we can approach this by dividing the problem in two working modes:
1 - Normal, meaning no drain_all_stock() running.
2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
destroying, reconfiguring a memcg.

For (1), we will have (ideally) only local cpu working on the percpu struct.
This mode will not have any kind of contention, because each CPU will hold it's
own spinlock only. 

For (2), we will have a lot of drain_all_stock() running. This will mean a lot
of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
local cpus having to wait for a lock to get their cache, on the patch proposal.

Ok, given the above is correct:

# Some arguments point that (1) becomes slower with this patch.

This is partially true: while test 2.2 pointed that local cpu functions running
time had became slower by a few cycles, test 2.4 points that the userspace
perception of it was that the syscalls and pagefaulting actually became faster:

During some debugging tests before getting the performance on test 2.4, I
noticed that the 'syscall + write' test would call all those functions that
became slower on test 2.2. Those functions were called multiple millions of
times during a single test, and still the patched version performance test
returned faster for test 2.4 than upstream version. Maybe the functions became
slower, but overall the usage of them in the usual context became faster.

Is not that a small improvement?

# Regarding (2), I notice that we fear contention 

While this seems to be the harder part of the discussion, I think we have enough
data to deal with it. 

In which case contention would be a big problem here? 
IIUC it would be when a lot of drain_all_stock() get running because the memory
limit is getting near. I mean, having the user to create / modify a memcg
multiple times a second for a while is not something that is expected, IMHO.

Now, if I assumed correctly and the case where contention could be a problem is
on a memcg with high memory pressure, then we have the argument that Marcelo
Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page
allocation brought better results than local_locks + schedule_work_on().

I mean, while contention would cause the cpu to wait for a while before getting
the lock for allocating a page from cache, something similar would happen with
schedule_work_on(), which would force the current task to wait while the
draining happens locally. 

What I am able to see is that, for each drain_all_stock(), for each cpu getting
drained we have the option to (a) (sometimes) wait for a lock to be freed, or
(b) wait for a whole context switch to happen.
And IIUC, (b) is much slower than (a) on average, and this is what causes the
improved performance seen in [P1].

(I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for
it to run on the remote CPU may not be that different, since the cacheline is
already writen to by the remote cpu on Upstream)

Also according to test 2.2, for the patched version, drain_local_stock() have
gotten faster (much faster for 128 cpus), even though it does all the draining
instead of just scheduling it on the other cpus. 
I mean, summing that to the brief nature of local cpu functions, we may not hit
contention as much as we are expected.

##

Sorry for the long text.
I may be missing some point, please let me know if that's the case.

Thanks a lot for reviewing!
Leo

[P1]: https://lkml.org/lkml/2022/6/13/2769
Michal Hocko Jan. 27, 2023, 6:58 a.m. UTC | #18
On Thu 26-01-23 21:32:18, Marcelo Tosatti wrote:
> On Thu, Jan 26, 2023 at 08:20:46PM +0100, Michal Hocko wrote:
> > On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote:
> > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > > > > Disclaimer:
> > > > > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > > > >     sections to better organize myself. I am not very confortable with it.
> > > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > > > >     from memcg_stock_pcp), which could further improve performance for
> > > > > > > > >     drain_all_stock(), but I could only notice the optimization at the
> > > > > > > > >     last minute.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 0 - Motivation:
> > > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > > > > descendant of a given root_memcg.
> > > > > 
> > > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > > > > into why we have many of them and whether we really need them?
> > > > > 
> > > > > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > > > > on multiple cpus).
> > > > 
> > > > I believe I've never got a specific answer to that. We
> > > > have discussed that in the previous version submission
> > > > (20221102020243.522358-1-leobras@redhat.com and specifically
> > > > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and
> > > > isolcpus. I was wondering about using memcgs in RT workloads because
> > > > that just sounds weird but let's say this is the case indeed. 
> > > 
> > > This could be the case. You can consider an "edge device" where it is
> > > necessary to run a RT workload. It might also be useful to run 
> > > non realtime applications on the same system.
> > > 
> > > > Then an RT task or whatever task that is running on an isolated
> > > > cpu can have pcp charges.
> > > 
> > > Usually the RT task (or more specifically the realtime sensitive loop
> > > of the application) runs entirely on userspace. But i suppose there
> > > could be charges on application startup.
> > 
> > What is the role of memcg then? If the memory limit is in place and the
> > workload doesn't fit in then it will get reclaimed during start up and
> > memory would need to be refaulted if not mlocked. If it is mlocked then
> > the limit cannot be enforced and the start up would likely fail as a
> > result of the memcg oom killer.
> 
> 1) Application which is not time sensitive executes on isolated CPU,
> with memcg control enabled. Per-CPU stock is created.
> 
> 2) App with memcg control enabled exits, per-CPU stock is not drained.
> 
> 3) Latency sensitive application starts, isolated per-CPU has stock to
> be drained, and:
> 
> /*
>  * Drains all per-CPU charge caches for given root_memcg resp. subtree
>  * of the hierarchy under it.
>  */
> static void drain_all_stock(struct mem_cgroup *root_memcg)

No, this is not really answering my question. See
Y9LQ615H13RmG7wL@dhcp22.suse.cz which already explains how the draining
would be triggered. This is not really happening on any operation.

I am really asking for specific workloads which are running multiple
processes on a mix of isolated and non-isolated cpus yet they share
memcg so that they can interfere. The consequences of the common memcg
are described above.
Michal Hocko Jan. 27, 2023, 7:11 a.m. UTC | #19
[Cc Frederic]

On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
[...]
> > > Essentially each cpu will try to grab the remains of the memory quota
> > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > on per-cgroup basis.
> > 
> > I think it would be more than sufficient to disable pcp charging on an
> > isolated cpu.
> 
> It might have significant performance consequences.

Is it really significant?

> I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> the accuracy of memory limits and slightly increase the memory footprint (all
> those dying memcgs...), but the impact will be limited. Actually it is limited
> by the number of cpus.

Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
potentially left behind should be small and that shouldn't really be a
concern for memcg oom situations (unless the limit is very small and
workloads on isolated cpus using small hard limits is way beyond my
imagination).

My first thought was that those charges could be left behind without any
upper bound but in reality sooner or later something should be running
on those cpus and if the memcg is gone the pcp cache would get refilled
and old charges gone.

So yes, this is actually a better and even simpler solution. All we need
is something like this
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..13b84bbd70ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		struct mem_cgroup *memcg;
 		bool flush = false;
 
+		if (cpu_is_isolated(cpu))
+			continue;
+
 		rcu_read_lock();
 		memcg = stock->cached;
 		if (memcg && stock->nr_pages &&

There is no such cpu_is_isolated() AFAICS so we would need a help from
NOHZ and cpuisol people to create one for us. Frederic, would such an
abstraction make any sense from your POV?
Leonardo Bras Jan. 27, 2023, 7:14 a.m. UTC | #20
On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
> On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > > Disclaimer:
> > > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > >     sections to better organize myself. I am not very confortable with it.
> > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > >     from memcg_stock_pcp), which could further improve performance for
> > > > > > >     drain_all_stock(), but I could only notice the optimization at the
> > > > > > >     last minute.
> > > > > > > 
> > > > > > > 
> > > > > > > 0 - Motivation:
> > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > > descendant of a given root_memcg.
> > > 
> > > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > > into why we have many of them and whether we really need them?
> > > 
> > > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > > on multiple cpus).
> > 
> > I believe I've never got a specific answer to that. We
> > have discussed that in the previous version submission
> > (20221102020243.522358-1-leobras@redhat.com and specifically
> > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and
> > isolcpus. I was wondering about using memcgs in RT workloads because
> > that just sounds weird but let's say this is the case indeed. Then an RT
> > task or whatever task that is running on an isolated cpu can have pcp
> > charges.
> > 
> > > Essentially each cpu will try to grab the remains of the memory quota
> > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > on per-cgroup basis.
> > 
> > I think it would be more than sufficient to disable pcp charging on an
> > isolated cpu.
> 
> It might have significant performance consequences.
> 
> I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> the accuracy of memory limits and slightly increase the memory footprint (all
> those dying memcgs...), but the impact will be limited. Actually it is limited
> by the number of cpus.

I was discussing this same idea with Marcelo yesterday morning.

The questions had in the topic were:
a - About how many pages the pcp cache will hold before draining them itself? 
b - Would it cache any kind of bigger page, or huge page in this same aspect?

Please let me know if I got anything wrong, but IIUC from a previous debug (a)'s
answer is 4 pages. Meaning even on bigger-page archs such as powerpc, with 64k
pages, the max pcp cache 'wasted' on each processor would be 256k (very small on
today's standard).

Please let me know if you have any info on (b), or any correcting on (a).

The thing is: having this drain_local_stock() waiver only for isolated cpus
would not bring the same benefits for non-isolated cpus in high memory pressure
as I understand this patchset is bringing.  

OTOH not running drain_local_stock() at all for every cpu may introduce
performance gains (no remote CPU access) but can be a problem if I got the
'wasted pages' could on (a) wrong. I mean, drain_local_stock() was introduced
for a reason.

> 
> > This is not a per memcg property.
> 
> Sure, my point was that in pre-oom condition several cpus might try to consolidate
> the remains of the memory quota, actually working against each other. Separate issue,
> which might be a reason why there are many flush attempts in the case we discuss.
> 
> Thanks!
> 

Thank you for reviewing!

Leo
Michal Hocko Jan. 27, 2023, 7:20 a.m. UTC | #21
On Fri 27-01-23 04:14:19, Leonardo Brás wrote:
> On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
[...]
> > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > the accuracy of memory limits and slightly increase the memory footprint (all
> > those dying memcgs...), but the impact will be limited. Actually it is limited
> > by the number of cpus.
> 
> I was discussing this same idea with Marcelo yesterday morning.
> 
> The questions had in the topic were:
> a - About how many pages the pcp cache will hold before draining them itself? 

MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache
doesn't really hold any pages. It is a mere counter of how many charges
have been accounted for the memcg page counter. So it is not really
consuming proportional amount of resources. It just pins the
corresponding memcg. Have a look at consume_stock and refill_stock

> b - Would it cache any kind of bigger page, or huge page in this same aspect?

The above should answer this as well as those following up I hope. If
not let me know.
Leonardo Bras Jan. 27, 2023, 7:22 a.m. UTC | #22
On Fri, 2023-01-27 at 08:11 +0100, Michal Hocko wrote:
> [Cc Frederic]
> 
> On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> [...]
> > > > Essentially each cpu will try to grab the remains of the memory quota
> > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > > on per-cgroup basis.
> > > 
> > > I think it would be more than sufficient to disable pcp charging on an
> > > isolated cpu.
> > 
> > It might have significant performance consequences.
> 
> Is it really significant?
> 
> > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > the accuracy of memory limits and slightly increase the memory footprint (all
> > those dying memcgs...), but the impact will be limited. Actually it is limited
> > by the number of cpus.
> 
> Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> potentially left behind should be small and that shouldn't really be a
> concern for memcg oom situations (unless the limit is very small and
> workloads on isolated cpus using small hard limits is way beyond my
> imagination).
> 
> My first thought was that those charges could be left behind without any
> upper bound but in reality sooner or later something should be running
> on those cpus and if the memcg is gone the pcp cache would get refilled
> and old charges gone.
> 
> So yes, this is actually a better and even simpler solution. All we need
> is something like this
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..13b84bbd70ba 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>  		struct mem_cgroup *memcg;
>  		bool flush = false;
>  
> +		if (cpu_is_isolated(cpu))
> +			continue;
> +
>  		rcu_read_lock();
>  		memcg = stock->cached;
>  		if (memcg && stock->nr_pages &&
> 
> There is no such cpu_is_isolated() AFAICS so we would need a help from
> NOHZ and cpuisol people to create one for us. Frederic, would such an
> abstraction make any sense from your POV?


IIUC, 'if (cpu_is_isolated())' would be instead:

if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) ||
!housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ)
Leonardo Bras Jan. 27, 2023, 7:35 a.m. UTC | #23
On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote:
> On Fri 27-01-23 04:14:19, Leonardo Brás wrote:
> > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
> [...]
> > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > by the number of cpus.
> > 
> > I was discussing this same idea with Marcelo yesterday morning.
> > 
> > The questions had in the topic were:
> > a - About how many pages the pcp cache will hold before draining them itself? 
> 
> MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache
> doesn't really hold any pages. It is a mere counter of how many charges
> have been accounted for the memcg page counter. So it is not really
> consuming proportional amount of resources. It just pins the
> corresponding memcg. Have a look at consume_stock and refill_stock

I see. Thanks for pointing that out!

So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1)
that are not getting used, and may cause an 'earlier' OOM if this amount is
needed but can't be freed.

In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each
holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources).
It's starting to get too big, but still ok for a machine this size.

The thing is that it can present an odd behavior: 
You have a cgroup created before, now empty, and try to run given application,
and hits OOM.
You then restart the cgroup, run the same application without an issue.

Even though it looks a good possibility, this can be perceived by user as
instability.

> 
> > b - Would it cache any kind of bigger page, or huge page in this same aspect?
> 
> The above should answer this as well as those following up I hope. If
> not let me know.

IIUC we are talking normal pages, is that it?

Best regards,
Leo
Leonardo Bras Jan. 27, 2023, 8:12 a.m. UTC | #24
On Fri, 2023-01-27 at 04:22 -0300, Leonardo Brás wrote:
> On Fri, 2023-01-27 at 08:11 +0100, Michal Hocko wrote:
> > [Cc Frederic]
> > 
> > On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > [...]
> > > > > Essentially each cpu will try to grab the remains of the memory quota
> > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > > > on per-cgroup basis.
> > > > 
> > > > I think it would be more than sufficient to disable pcp charging on an
> > > > isolated cpu.
> > > 
> > > It might have significant performance consequences.
> > 
> > Is it really significant?
> > 
> > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > by the number of cpus.
> > 
> > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> > potentially left behind should be small and that shouldn't really be a
> > concern for memcg oom situations (unless the limit is very small and
> > workloads on isolated cpus using small hard limits is way beyond my
> > imagination).
> > 
> > My first thought was that those charges could be left behind without any
> > upper bound but in reality sooner or later something should be running
> > on those cpus and if the memcg is gone the pcp cache would get refilled
> > and old charges gone.
> > 
> > So yes, this is actually a better and even simpler solution. All we need
> > is something like this
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ab457f0394ab..13b84bbd70ba 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> >  		struct mem_cgroup *memcg;
> >  		bool flush = false;
> >  
> > +		if (cpu_is_isolated(cpu))
> > +			continue;
> > +
> >  		rcu_read_lock();
> >  		memcg = stock->cached;
> >  		if (memcg && stock->nr_pages &&
> > 
> > There is no such cpu_is_isolated() AFAICS so we would need a help from
> > NOHZ and cpuisol people to create one for us. Frederic, would such an
> > abstraction make any sense from your POV?
> 
> 
> IIUC, 'if (cpu_is_isolated())' would be instead:
> 
> if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) ||
> !housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ)

oh, sorry 's/smp_processor_id()/cpu/' here:

if(!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || !housekeeping_cpu(cpu, HK_TYPE_WQ))


Not sure why I added smp_processor_id() instead of cpu. I think I need some
sleep. :)
Michal Hocko Jan. 27, 2023, 9:23 a.m. UTC | #25
On Fri 27-01-23 05:12:13, Leonardo Brás wrote:
> On Fri, 2023-01-27 at 04:22 -0300, Leonardo Brás wrote:
> > On Fri, 2023-01-27 at 08:11 +0100, Michal Hocko wrote:
> > > [Cc Frederic]
> > > 
> > > On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> > > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > > [...]
> > > > > > Essentially each cpu will try to grab the remains of the memory quota
> > > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > > > > on per-cgroup basis.
> > > > > 
> > > > > I think it would be more than sufficient to disable pcp charging on an
> > > > > isolated cpu.
> > > > 
> > > > It might have significant performance consequences.
> > > 
> > > Is it really significant?
> > > 
> > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > > by the number of cpus.
> > > 
> > > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> > > potentially left behind should be small and that shouldn't really be a
> > > concern for memcg oom situations (unless the limit is very small and
> > > workloads on isolated cpus using small hard limits is way beyond my
> > > imagination).
> > > 
> > > My first thought was that those charges could be left behind without any
> > > upper bound but in reality sooner or later something should be running
> > > on those cpus and if the memcg is gone the pcp cache would get refilled
> > > and old charges gone.
> > > 
> > > So yes, this is actually a better and even simpler solution. All we need
> > > is something like this
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index ab457f0394ab..13b84bbd70ba 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > >  		struct mem_cgroup *memcg;
> > >  		bool flush = false;
> > >  
> > > +		if (cpu_is_isolated(cpu))
> > > +			continue;
> > > +
> > >  		rcu_read_lock();
> > >  		memcg = stock->cached;
> > >  		if (memcg && stock->nr_pages &&
> > > 
> > > There is no such cpu_is_isolated() AFAICS so we would need a help from
> > > NOHZ and cpuisol people to create one for us. Frederic, would such an
> > > abstraction make any sense from your POV?
> > 
> > 
> > IIUC, 'if (cpu_is_isolated())' would be instead:
> > 
> > if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) ||
> > !housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ)
> 
> oh, sorry 's/smp_processor_id()/cpu/' here:
> 
> if(!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || !housekeeping_cpu(cpu, HK_TYPE_WQ))

Or maybe we can get a nice abstract API so that we do not have to really
care about those low level details. I do not really know what those
really mean and hopefully I shouldn't really need to know.
Michal Hocko Jan. 27, 2023, 9:29 a.m. UTC | #26
On Fri 27-01-23 04:35:22, Leonardo Brás wrote:
> On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote:
> > On Fri 27-01-23 04:14:19, Leonardo Brás wrote:
> > > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
> > [...]
> > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > > by the number of cpus.
> > > 
> > > I was discussing this same idea with Marcelo yesterday morning.
> > > 
> > > The questions had in the topic were:
> > > a - About how many pages the pcp cache will hold before draining them itself? 
> > 
> > MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache
> > doesn't really hold any pages. It is a mere counter of how many charges
> > have been accounted for the memcg page counter. So it is not really
> > consuming proportional amount of resources. It just pins the
> > corresponding memcg. Have a look at consume_stock and refill_stock
> 
> I see. Thanks for pointing that out!
> 
> So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1)

s@numcpus@num_isolated_cpus@

> that are not getting used, and may cause an 'earlier' OOM if this amount is
> needed but can't be freed.

s@OOM@memcg OOM@
 
> In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each
> holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources).
> It's starting to get too big, but still ok for a machine this size.

It is more about the memcg limit rather than the size of the machine.
Again, let's focus on actual usacase. What is the usual memcg setup with
those isolcpus

> The thing is that it can present an odd behavior: 
> You have a cgroup created before, now empty, and try to run given application,
> and hits OOM.

The application would either consume those cached charges or flush them
if it is running in a different memcg. Or what do you have in mind?

> You then restart the cgroup, run the same application without an issue.
> 
> Even though it looks a good possibility, this can be perceived by user as
> instability.
> 
> > 
> > > b - Would it cache any kind of bigger page, or huge page in this same aspect?
> > 
> > The above should answer this as well as those following up I hope. If
> > not let me know.
> 
> IIUC we are talking normal pages, is that it?

We are talking about memcg charges and those have page granularity.
Frederic Weisbecker Jan. 27, 2023, 1:03 p.m. UTC | #27
On Fri, Jan 27, 2023 at 05:12:13AM -0300, Leonardo Brás wrote:
> On Fri, 2023-01-27 at 04:22 -0300, Leonardo Brás wrote:
> > > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> > > potentially left behind should be small and that shouldn't really be a
> > > concern for memcg oom situations (unless the limit is very small and
> > > workloads on isolated cpus using small hard limits is way beyond my
> > > imagination).
> > > 
> > > My first thought was that those charges could be left behind without any
> > > upper bound but in reality sooner or later something should be running
> > > on those cpus and if the memcg is gone the pcp cache would get refilled
> > > and old charges gone.
> > > 
> > > So yes, this is actually a better and even simpler solution. All we need
> > > is something like this
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index ab457f0394ab..13b84bbd70ba 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > >  		struct mem_cgroup *memcg;
> > >  		bool flush = false;
> > >  
> > > +		if (cpu_is_isolated(cpu))
> > > +			continue;
> > > +
> > >  		rcu_read_lock();
> > >  		memcg = stock->cached;
> > >  		if (memcg && stock->nr_pages &&
> > > 
> > > There is no such cpu_is_isolated() AFAICS so we would need a help from
> > > NOHZ and cpuisol people to create one for us. Frederic, would such an
> > > abstraction make any sense from your POV?
> > 
> > 
> > IIUC, 'if (cpu_is_isolated())' would be instead:
> > 
> > if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_DOMAIN) ||
> > !housekeeping_cpu(smp_processor_id(), HK_TYPE_WQ)
> 
> oh, sorry 's/smp_processor_id()/cpu/' here:
> 
> if(!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || !housekeeping_cpu(cpu, HK_TYPE_WQ))

Do you also need to handle cpuset.sched_load_balance=0 (aka. cpuset v2 "isolated"
partition type). It has the same effect as isolcpus=, but it can be changed at
runtime.

And then on_null_domain() look like what you need. You'd have to make that API
more generally available though, and rename it to something like
"bool cpu_has_null_domain(int cpu)".

But then you also need to handle concurrent cpuset changes. If you can tolerate
it to be racy, then RCU alone is fine.

Thanks.
Michal Hocko Jan. 27, 2023, 1:58 p.m. UTC | #28
On Fri 27-01-23 08:11:04, Michal Hocko wrote:
> [Cc Frederic]
> 
> On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> [...]
> > > > Essentially each cpu will try to grab the remains of the memory quota
> > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > > on per-cgroup basis.
> > > 
> > > I think it would be more than sufficient to disable pcp charging on an
> > > isolated cpu.
> > 
> > It might have significant performance consequences.
> 
> Is it really significant?
> 
> > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > the accuracy of memory limits and slightly increase the memory footprint (all
> > those dying memcgs...), but the impact will be limited. Actually it is limited
> > by the number of cpus.
> 
> Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> potentially left behind should be small and that shouldn't really be a
> concern for memcg oom situations (unless the limit is very small and
> workloads on isolated cpus using small hard limits is way beyond my
> imagination).
> 
> My first thought was that those charges could be left behind without any
> upper bound but in reality sooner or later something should be running
> on those cpus and if the memcg is gone the pcp cache would get refilled
> and old charges gone.
> 
> So yes, this is actually a better and even simpler solution. All we need
> is something like this
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..13b84bbd70ba 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>  		struct mem_cgroup *memcg;
>  		bool flush = false;
>  
> +		if (cpu_is_isolated(cpu))
> +			continue;
> +
>  		rcu_read_lock();
>  		memcg = stock->cached;
>  		if (memcg && stock->nr_pages &&

Btw. this would be over pessimistic. The following should make more
sense:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..55e440e54504 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2357,7 +2357,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
 			if (cpu == curcpu)
 				drain_local_stock(&stock->work);
-			else
+			else if (!cpu_is_isolated(cpu))
 				schedule_work_on(cpu, &stock->work);
 		}
 	}
Roman Gushchin Jan. 27, 2023, 6:18 p.m. UTC | #29
On Fri, Jan 27, 2023 at 02:58:19PM +0100, Michal Hocko wrote:
> On Fri 27-01-23 08:11:04, Michal Hocko wrote:
> > [Cc Frederic]
> > 
> > On Thu 26-01-23 15:12:35, Roman Gushchin wrote:
> > > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > [...]
> > > > > Essentially each cpu will try to grab the remains of the memory quota
> > > > > and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
> > > > > on per-cgroup basis.
> > > > 
> > > > I think it would be more than sufficient to disable pcp charging on an
> > > > isolated cpu.
> > > 
> > > It might have significant performance consequences.
> > 
> > Is it really significant?
> > 
> > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > by the number of cpus.
> > 
> > Hmm, OK, I have misunderstood your proposal. Yes, the overal pcp charges
> > potentially left behind should be small and that shouldn't really be a
> > concern for memcg oom situations (unless the limit is very small and
> > workloads on isolated cpus using small hard limits is way beyond my
> > imagination).
> > 
> > My first thought was that those charges could be left behind without any
> > upper bound but in reality sooner or later something should be running
> > on those cpus and if the memcg is gone the pcp cache would get refilled
> > and old charges gone.
> > 
> > So yes, this is actually a better and even simpler solution. All we need
> > is something like this
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ab457f0394ab..13b84bbd70ba 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2344,6 +2344,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> >  		struct mem_cgroup *memcg;
> >  		bool flush = false;
> >  
> > +		if (cpu_is_isolated(cpu))
> > +			continue;
> > +
> >  		rcu_read_lock();
> >  		memcg = stock->cached;
> >  		if (memcg && stock->nr_pages &&
> 
> Btw. this would be over pessimistic. The following should make more
> sense:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..55e440e54504 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2357,7 +2357,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>  		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
>  			if (cpu == curcpu)
>  				drain_local_stock(&stock->work);
> -			else
> +			else if (!cpu_is_isolated(cpu))
>  				schedule_work_on(cpu, &stock->work);
>  		}
>  	}


Yes, this is exactly what I was thinking of. It should solve the problem
for isolated cpus well enough without introducing an overhead for everybody else.

If you'll make a proper patch, please add my
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

I understand the concerns regarding spurious OOMs on 256-cores machine,
but I guess they are somewhat theoretical and also possible with the current code
(e.g. one ooming cgroup can effectively block draining for everybody else).

Thanks!
Leonardo Bras Jan. 27, 2023, 7:29 p.m. UTC | #30
On Fri, 2023-01-27 at 10:29 +0100, Michal Hocko wrote:
> On Fri 27-01-23 04:35:22, Leonardo Brás wrote:
> > On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote:
> > > On Fri 27-01-23 04:14:19, Leonardo Brás wrote:
> > > > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
> > > [...]
> > > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > > > by the number of cpus.
> > > > 
> > > > I was discussing this same idea with Marcelo yesterday morning.
> > > > 
> > > > The questions had in the topic were:
> > > > a - About how many pages the pcp cache will hold before draining them itself? 
> > > 
> > > MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache
> > > doesn't really hold any pages. It is a mere counter of how many charges
> > > have been accounted for the memcg page counter. So it is not really
> > > consuming proportional amount of resources. It just pins the
> > > corresponding memcg. Have a look at consume_stock and refill_stock
> > 
> > I see. Thanks for pointing that out!
> > 
> > So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1)
> 
> s@numcpus@num_isolated_cpus@

I was thinking worst case scenario being (ncpus - 1) being isolated.

> 
> > that are not getting used, and may cause an 'earlier' OOM if this amount is
> > needed but can't be freed.
> 
> s@OOM@memcg OOM@
 
> > In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each
> > holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources).
> > It's starting to get too big, but still ok for a machine this size.
> 
> It is more about the memcg limit rather than the size of the machine.
> Again, let's focus on actual usacase. What is the usual memcg setup with
> those isolcpus

I understand it's about the limit, not actually allocated memory. When I point
the machine size, I mean what is expected to be acceptable from a user in that
machine.

> 
> > The thing is that it can present an odd behavior: 
> > You have a cgroup created before, now empty, and try to run given application,
> > and hits OOM.
> 
> The application would either consume those cached charges or flush them
> if it is running in a different memcg. Or what do you have in mind?

1 - Create a memcg with a VM inside, multiple vcpus pinned to isolated cpus. 
2 - Run multi-cpu task inside the VM, it allocates memory for every CPU and keep
    the pcp cache
3 - Try to run a single-cpu task (pinned?) inside the VM, which uses almost all
    the available memory.
4 - memcg OOM.

Does it make sense?


> 
> > You then restart the cgroup, run the same application without an issue.
> > 
> > Even though it looks a good possibility, this can be perceived by user as
> > instability.
> > 
> > > 
> > > > b - Would it cache any kind of bigger page, or huge page in this same aspect?
> > > 
> > > The above should answer this as well as those following up I hope. If
> > > not let me know.
> > 
> > IIUC we are talking normal pages, is that it?
> 
> We are talking about memcg charges and those have page granularity.
> 

Thanks for the info!

Also, thanks for the feedback!
Leo
Roman Gushchin Jan. 27, 2023, 11:50 p.m. UTC | #31
On Fri, Jan 27, 2023 at 04:29:37PM -0300, Leonardo Brás wrote:
> On Fri, 2023-01-27 at 10:29 +0100, Michal Hocko wrote:
> > On Fri 27-01-23 04:35:22, Leonardo Brás wrote:
> > > On Fri, 2023-01-27 at 08:20 +0100, Michal Hocko wrote:
> > > > On Fri 27-01-23 04:14:19, Leonardo Brás wrote:
> > > > > On Thu, 2023-01-26 at 15:12 -0800, Roman Gushchin wrote:
> > > > [...]
> > > > > > I'd rather opt out of stock draining for isolated cpus: it might slightly reduce
> > > > > > the accuracy of memory limits and slightly increase the memory footprint (all
> > > > > > those dying memcgs...), but the impact will be limited. Actually it is limited
> > > > > > by the number of cpus.
> > > > > 
> > > > > I was discussing this same idea with Marcelo yesterday morning.
> > > > > 
> > > > > The questions had in the topic were:
> > > > > a - About how many pages the pcp cache will hold before draining them itself? 
> > > > 
> > > > MEMCG_CHARGE_BATCH (64 currently). And one more clarification. The cache
> > > > doesn't really hold any pages. It is a mere counter of how many charges
> > > > have been accounted for the memcg page counter. So it is not really
> > > > consuming proportional amount of resources. It just pins the
> > > > corresponding memcg. Have a look at consume_stock and refill_stock
> > > 
> > > I see. Thanks for pointing that out!
> > > 
> > > So in worst case scenario the memcg would have reserved 64 pages * (numcpus - 1)
> > 
> > s@numcpus@num_isolated_cpus@
> 
> I was thinking worst case scenario being (ncpus - 1) being isolated.
> 
> > 
> > > that are not getting used, and may cause an 'earlier' OOM if this amount is
> > > needed but can't be freed.
> > 
> > s@OOM@memcg OOM@
>  
> > > In the wave of worst case, supposing a big powerpc machine, 256 CPUs, each
> > > holding 64k * 64 pages => 1GB memory - 4MB (one cpu using resources).
> > > It's starting to get too big, but still ok for a machine this size.
> > 
> > It is more about the memcg limit rather than the size of the machine.
> > Again, let's focus on actual usacase. What is the usual memcg setup with
> > those isolcpus
> 
> I understand it's about the limit, not actually allocated memory. When I point
> the machine size, I mean what is expected to be acceptable from a user in that
> machine.
> 
> > 
> > > The thing is that it can present an odd behavior: 
> > > You have a cgroup created before, now empty, and try to run given application,
> > > and hits OOM.
> > 
> > The application would either consume those cached charges or flush them
> > if it is running in a different memcg. Or what do you have in mind?
> 
> 1 - Create a memcg with a VM inside, multiple vcpus pinned to isolated cpus. 
> 2 - Run multi-cpu task inside the VM, it allocates memory for every CPU and keep
>     the pcp cache
> 3 - Try to run a single-cpu task (pinned?) inside the VM, which uses almost all
>     the available memory.
> 4 - memcg OOM.
> 
> Does it make sense?

It can happen now as well, you just need a competing drain request.

Honestly, I feel the probability of this scenario to be a real problem is fairly low.
I don't recall any complains on spurious OOMs because of races in the draining code.
Usually machines which are tight on memory are rarely have so many idle cpus.

Thanks!
Marcelo Tosatti Jan. 31, 2023, 11:35 a.m. UTC | #32
On Fri, Jan 27, 2023 at 03:55:39AM -0300, Leonardo Brás wrote:
> On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote:
> > On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> > > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > > > [...]
> > > > > Remote draining reduces interruptions whether CPU 
> > > > > is marked as isolated or not:
> > > > > 
> > > > > - Allows isolated CPUs from benefiting of pcp caching.
> > > > > - Removes the interruption to non isolated CPUs. See for example 
> > > > > 
> > > > > https://lkml.org/lkml/2022/6/13/2769
> > > > 
> > > > This is talking about page allocato per cpu caches, right? In this patch
> > > > we are talking about memcg pcp caches. Are you sure the same applies
> > > > here?
> > > 
> > > Both can stall the users of the drain operation.
> > 
> > Yes. But it is important to consider who those users are. We are
> > draining when
> > 	- we are charging and the limit is hit so that memory reclaim
> > 	  has to be triggered.
> > 	- hard, high limits are set and require memory reclaim.
> > 	- force_empty - full memory reclaim for a memcg
> > 	- memcg offlining - cgroup removel - quite a heavy operation as
> > 	  well.
> > all those could be really costly kernel operations and they affect
> > isolated cpu only if the same memcg is used by both isolated and non-isolated
> > cpus. In other words those costly operations would have to be triggered
> > from non-isolated cpus and those are to be expected to be stalled. It is
> > the side effect of the local cpu draining that is scheduled that affects
> > the isolated cpu as well.
> > 
> > Is that more clear?
> 
> I think so, please help me check:
> 
> IIUC, we can approach this by dividing the problem in two working modes:
> 1 - Normal, meaning no drain_all_stock() running.
> 2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
> destroying, reconfiguring a memcg.
> 
> For (1), we will have (ideally) only local cpu working on the percpu struct.
> This mode will not have any kind of contention, because each CPU will hold it's
> own spinlock only. 
> 
> For (2), we will have a lot of drain_all_stock() running. This will mean a lot
> of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
> local cpus having to wait for a lock to get their cache, on the patch proposal.
> 
> Ok, given the above is correct:
> 
> # Some arguments point that (1) becomes slower with this patch.
> 
> This is partially true: while test 2.2 pointed that local cpu functions running
> time had became slower by a few cycles, test 2.4 points that the userspace
> perception of it was that the syscalls and pagefaulting actually became faster:
> 
> During some debugging tests before getting the performance on test 2.4, I
> noticed that the 'syscall + write' test would call all those functions that
> became slower on test 2.2. Those functions were called multiple millions of
> times during a single test, and still the patched version performance test
> returned faster for test 2.4 than upstream version. Maybe the functions became
> slower, but overall the usage of them in the usual context became faster.
> 
> Is not that a small improvement?
> 
> # Regarding (2), I notice that we fear contention 
> 
> While this seems to be the harder part of the discussion, I think we have enough
> data to deal with it. 
> 
> In which case contention would be a big problem here? 
> IIUC it would be when a lot of drain_all_stock() get running because the memory
> limit is getting near. I mean, having the user to create / modify a memcg
> multiple times a second for a while is not something that is expected, IMHO.

Considering that the use of spinlocks with remote draining is the more general solution,
what would be a test-case to demonstrate a contention problem?

> Now, if I assumed correctly and the case where contention could be a problem is
> on a memcg with high memory pressure, then we have the argument that Marcelo
> Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page
> allocation brought better results than local_locks + schedule_work_on().
> 
> I mean, while contention would cause the cpu to wait for a while before getting
> the lock for allocating a page from cache, something similar would happen with
> schedule_work_on(), which would force the current task to wait while the
> draining happens locally. 
> 
> What I am able to see is that, for each drain_all_stock(), for each cpu getting
> drained we have the option to (a) (sometimes) wait for a lock to be freed, or
> (b) wait for a whole context switch to happen.
> And IIUC, (b) is much slower than (a) on average, and this is what causes the
> improved performance seen in [P1].

Moreover, there is a delay for the remote CPU to execute a work item 
(there could be high priority tasks, IRQ handling on the remote CPU,
which delays execution of the work item even further).

Also, the other point is that the spinlock already exists for
PREEMPT_RT (which means that any potential contention issue 
with the spinlock today is limited to PREEMPT_RT users).

So it would be good to point out a specific problematic 
testcase/scenario with using the spinlock in this particular case?

> (I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for
> it to run on the remote CPU may not be that different, since the cacheline is
> already writen to by the remote cpu on Upstream)
> 
> Also according to test 2.2, for the patched version, drain_local_stock() have
> gotten faster (much faster for 128 cpus), even though it does all the draining
> instead of just scheduling it on the other cpus. 
> I mean, summing that to the brief nature of local cpu functions, we may not hit
> contention as much as we are expected.
> 
> ##
> 
> Sorry for the long text.
> I may be missing some point, please let me know if that's the case.
> 
> Thanks a lot for reviewing!
> Leo
> 
> [P1]: https://lkml.org/lkml/2022/6/13/2769
> 
>
Leonardo Bras Feb. 1, 2023, 4:36 a.m. UTC | #33
On Tue, 2023-01-31 at 08:35 -0300, Marcelo Tosatti wrote:
> On Fri, Jan 27, 2023 at 03:55:39AM -0300, Leonardo Brás wrote:
> > On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote:
> > > On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> > > > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > > > > [...]
> > > > > > Remote draining reduces interruptions whether CPU 
> > > > > > is marked as isolated or not:
> > > > > > 
> > > > > > - Allows isolated CPUs from benefiting of pcp caching.
> > > > > > - Removes the interruption to non isolated CPUs. See for example 
> > > > > > 
> > > > > > https://lkml.org/lkml/2022/6/13/2769
> > > > > 
> > > > > This is talking about page allocato per cpu caches, right? In this patch
> > > > > we are talking about memcg pcp caches. Are you sure the same applies
> > > > > here?
> > > > 
> > > > Both can stall the users of the drain operation.
> > > 
> > > Yes. But it is important to consider who those users are. We are
> > > draining when
> > > 	- we are charging and the limit is hit so that memory reclaim
> > > 	  has to be triggered.
> > > 	- hard, high limits are set and require memory reclaim.
> > > 	- force_empty - full memory reclaim for a memcg
> > > 	- memcg offlining - cgroup removel - quite a heavy operation as
> > > 	  well.
> > > all those could be really costly kernel operations and they affect
> > > isolated cpu only if the same memcg is used by both isolated and non-isolated
> > > cpus. In other words those costly operations would have to be triggered
> > > from non-isolated cpus and those are to be expected to be stalled. It is
> > > the side effect of the local cpu draining that is scheduled that affects
> > > the isolated cpu as well.
> > > 
> > > Is that more clear?
> > 
> > I think so, please help me check:

Michal, Roman: Could you please review my argumentation below, so I can
understand what exactly is wrong ?

> > 
> > IIUC, we can approach this by dividing the problem in two working modes:
> > 1 - Normal, meaning no drain_all_stock() running.
> > 2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
> > destroying, reconfiguring a memcg.
> > 
> > For (1), we will have (ideally) only local cpu working on the percpu struct.
> > This mode will not have any kind of contention, because each CPU will hold it's
> > own spinlock only. 
> > 
> > For (2), we will have a lot of drain_all_stock() running. This will mean a lot
> > of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
> > local cpus having to wait for a lock to get their cache, on the patch proposal.
> > 
> > Ok, given the above is correct:
> > 
> > # Some arguments point that (1) becomes slower with this patch.
> > 
> > This is partially true: while test 2.2 pointed that local cpu functions running
> > time had became slower by a few cycles, test 2.4 points that the userspace
> > perception of it was that the syscalls and pagefaulting actually became faster:
> > 
> > During some debugging tests before getting the performance on test 2.4, I
> > noticed that the 'syscall + write' test would call all those functions that
> > became slower on test 2.2. Those functions were called multiple millions of
> > times during a single test, and still the patched version performance test
> > returned faster for test 2.4 than upstream version. Maybe the functions became
> > slower, but overall the usage of them in the usual context became faster.
> > 
> > Is not that a small improvement?
> > 
> > # Regarding (2), I notice that we fear contention 
> > 
> > While this seems to be the harder part of the discussion, I think we have enough
> > data to deal with it. 
> > 
> > In which case contention would be a big problem here? 
> > IIUC it would be when a lot of drain_all_stock() get running because the memory
> > limit is getting near. I mean, having the user to create / modify a memcg
> > multiple times a second for a while is not something that is expected, IMHO.
> 
> Considering that the use of spinlocks with remote draining is the more general solution,
> what would be a test-case to demonstrate a contention problem?

IIUC we could try to reproduce a memory tight workload that keeps allocating /
freeing from different cpus (without hitting OOM).

Michal, Roman: Is that correct? You have any workload like that so we can test?

> 
> > Now, if I assumed correctly and the case where contention could be a problem is
> > on a memcg with high memory pressure, then we have the argument that Marcelo
> > Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page
> > allocation brought better results than local_locks + schedule_work_on().
> > 
> > I mean, while contention would cause the cpu to wait for a while before getting
> > the lock for allocating a page from cache, something similar would happen with
> > schedule_work_on(), which would force the current task to wait while the
> > draining happens locally. 
> > 
> > What I am able to see is that, for each drain_all_stock(), for each cpu getting
> > drained we have the option to (a) (sometimes) wait for a lock to be freed, or
> > (b) wait for a whole context switch to happen.
> > And IIUC, (b) is much slower than (a) on average, and this is what causes the
> > improved performance seen in [P1].
> 
> Moreover, there is a delay for the remote CPU to execute a work item 
> (there could be high priority tasks, IRQ handling on the remote CPU,
> which delays execution of the work item even further).
> 
> Also, the other point is that the spinlock already exists for
> PREEMPT_RT (which means that any potential contention issue 
> with the spinlock today is limited to PREEMPT_RT users).
> 
> So it would be good to point out a specific problematic 
> testcase/scenario with using the spinlock in this particular case?


Oh, that's a good point, but IIUC spin_lock() replaces local_lock() in memcg,
meaning it will always run in the same CPU, and there should only be any
contention if the memcg local cpu functions get preempted by something that
calls a memcg local cpu function.

> 
> > (I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for
> > it to run on the remote CPU may not be that different, since the cacheline is
> > already writen to by the remote cpu on Upstream)
> > 
> > Also according to test 2.2, for the patched version, drain_local_stock() have
> > gotten faster (much faster for 128 cpus), even though it does all the draining
> > instead of just scheduling it on the other cpus. 
> > I mean, summing that to the brief nature of local cpu functions, we may not hit
> > contention as much as we are expected.
> > 
> > ##
> > 
> > Sorry for the long text.
> > I may be missing some point, please let me know if that's the case.
> > 
> > Thanks a lot for reviewing!
> > Leo
> > 
> > [P1]: https://lkml.org/lkml/2022/6/13/2769
> > 
> > 
> 

Thanks!
Leo
Michal Hocko Feb. 1, 2023, 12:41 p.m. UTC | #34
On Tue 31-01-23 08:35:34, Marcelo Tosatti wrote:
[...]
> So it would be good to point out a specific problematic 
> testcase/scenario with using the spinlock in this particular case?

Please think about it some more. The sole purpose of the pcp charge
caching is to avoid atomics because the normal fast path (i.e. no limit
hit) is a page counter which is an atomic counter. If you go with a spin
lock for the pcp cache you are just losing some of the advantage of the
caching. Sure that would be a smaller atomics use than a deeper memcg
hierarchy but still.

Not to mention a potential contention which is hard to predict and it
will depend on the specific runtime very much. So not something that
would be easy to test for other than artificial testcases. Full cpu
pcp draining is not a very common operation and one could argue that
the problem will be limited but so far I haven't really heard any strong
arguments against the proposal to avoid scheduling the work on isolated
cpus which is a much more simpler solution and achieves the same
effect.
Michal Hocko Feb. 1, 2023, 12:52 p.m. UTC | #35
On Wed 01-02-23 01:36:07, Leonardo Brás wrote:
[...]
> Michal, Roman: Could you please review my argumentation below, so I can
> understand what exactly is wrong ?
> 
> > > 
> > > IIUC, we can approach this by dividing the problem in two working modes:
> > > 1 - Normal, meaning no drain_all_stock() running.
> > > 2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
> > > destroying, reconfiguring a memcg.
> > > 
> > > For (1), we will have (ideally) only local cpu working on the percpu struct.
> > > This mode will not have any kind of contention, because each CPU will hold it's
> > > own spinlock only. 

You are still hitting locked atomic operations which is not the case for
a simple pcp access. As already mentioned page counter (which would be
taken in case of no pcp cache) is also about atomics (elbeit more of
them with a deeper cgroup hierarchy).

> > > For (2), we will have a lot of drain_all_stock() running. This will mean a lot
> > > of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
> > > local cpus having to wait for a lock to get their cache, on the patch proposal.

Yes.

> > > Ok, given the above is correct:
> > > 
> > > # Some arguments point that (1) becomes slower with this patch.
> > > 
> > > This is partially true: while test 2.2 pointed that local cpu functions running
> > > time had became slower by a few cycles, test 2.4 points that the userspace
> > > perception of it was that the syscalls and pagefaulting actually became faster:
> > > 
> > > During some debugging tests before getting the performance on test 2.4, I
> > > noticed that the 'syscall + write' test would call all those functions that
> > > became slower on test 2.2. Those functions were called multiple millions of
> > > times during a single test, and still the patched version performance test
> > > returned faster for test 2.4 than upstream version. Maybe the functions became
> > > slower, but overall the usage of them in the usual context became faster.

It is hard to tell anything without proper analysis of those numbers.

> > > Is not that a small improvement?
> > > 
> > > # Regarding (2), I notice that we fear contention 
> > > 
> > > While this seems to be the harder part of the discussion, I think we have enough
> > > data to deal with it. 
> > > 
> > > In which case contention would be a big problem here? 
> > > IIUC it would be when a lot of drain_all_stock() get running because the memory
> > > limit is getting near. I mean, having the user to create / modify a memcg
> > > multiple times a second for a while is not something that is expected, IMHO.

We have already explained when that happens. You are right this is not
happening all that much in most workloads. But this is a user triggered
operation so you cannot really many assumptions. It will really depend
on the workload.

> > Considering that the use of spinlocks with remote draining is the more general solution,
> > what would be a test-case to demonstrate a contention problem?
> 
> IIUC we could try to reproduce a memory tight workload that keeps allocating /
> freeing from different cpus (without hitting OOM).
> 
> Michal, Roman: Is that correct? You have any workload like that so we can test?

It would be easy to create artificial workload that would hit this. All
you need is to trigger any of the code paths that have already been
mentioned elsewhere from the userspace.

Let me be clear here. Unless there are some real concerns about not
flushing remotely on isolated cpus then the spin lock approach is just
much less preferable. So I would much rather focus on the trivial patch
and investigates whether there are no new corner cases to be created by
that.
Roman Gushchin Feb. 1, 2023, 6:31 p.m. UTC | #36
On Thu, Jan 26, 2023 at 08:20:46PM +0100, Michal Hocko wrote:
> On Thu 26-01-23 15:03:43, Marcelo Tosatti wrote:
> > On Thu, Jan 26, 2023 at 08:41:34AM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 15:14:48, Roman Gushchin wrote:
> > > > On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> > > > > On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > > > > > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > > > > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > > > > > Disclaimer:
> > > > > > > > a - The cover letter got bigger than expected, so I had to split it in
> > > > > > > >     sections to better organize myself. I am not very confortable with it.
> > > > > > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > > > > > >     from memcg_stock_pcp), which could further improve performance for
> > > > > > > >     drain_all_stock(), but I could only notice the optimization at the
> > > > > > > >     last minute.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 0 - Motivation:
> > > > > > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > > > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > > > > > descendant of a given root_memcg.
> > > > 
> > > > Do you know what caused those drain_all_stock() calls? I wonder if we should look
> > > > into why we have many of them and whether we really need them?
> > > > 
> > > > It's either some user's actions (e.g. reducing memory.max), either some memcg
> > > > is entering pre-oom conditions. In the latter case a lot of drain calls can be
> > > > scheduled without a good reason (assuming the cgroup contain multiple tasks running
> > > > on multiple cpus).
> > > 
> > > I believe I've never got a specific answer to that. We
> > > have discussed that in the previous version submission
> > > (20221102020243.522358-1-leobras@redhat.com and specifically
> > > Y2TQLavnLVd4qHMT@dhcp22.suse.cz). Leonardo has mentioned a mix of RT and
> > > isolcpus. I was wondering about using memcgs in RT workloads because
> > > that just sounds weird but let's say this is the case indeed. 
> > 
> > This could be the case. You can consider an "edge device" where it is
> > necessary to run a RT workload. It might also be useful to run 
> > non realtime applications on the same system.
> > 
> > > Then an RT task or whatever task that is running on an isolated
> > > cpu can have pcp charges.
> > 
> > Usually the RT task (or more specifically the realtime sensitive loop
> > of the application) runs entirely on userspace. But i suppose there
> > could be charges on application startup.
> 
> What is the role of memcg then? If the memory limit is in place and the
> workload doesn't fit in then it will get reclaimed during start up and
> memory would need to be refaulted if not mlocked. If it is mlocked then
> the limit cannot be enforced and the start up would likely fail as a
> result of the memcg oom killer.
> 
> [...]
> > > > Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
> > > > for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
> > > > doing this without a really strong reason.
> > > 
> > > Are you OK with a simple opt out on isolated CPUs? That would make
> > > charges slightly slower (atomic on the hierarchy counters vs. a single
> > > pcp adjustment) but it would guarantee that the isolated workload is
> > > predictable which is the primary objective AFAICS.
> > 
> > This would make isolated CPUs "second class citizens": it would be nice
> > to be able to execute non realtime apps on isolated CPUs as well
> > (think of different periods of time during a day, one where 
> > more realtime apps are required, another where less 
> > realtime apps are required).
> 
> An alternative requires to make the current implementation that is
> lockless to use locks and introduce potential lock contention. This
> could be harmful to regular workloads. Not using pcp caching would make
> a fast path using few atomics rather than local pcp update. That is not
> a terrible cost to pay for special cased workloads which use isolcpus.
> Really we are not talking about a massive cost to be payed. At least
> nobody has shown that in any numbers.

Can't agree more.
I also agree that the whole pcpu stock draining code can be enhanced,
but I believe we should go into the direction almost directly opposite
to what's being proposed here.

Can we please return to the original problem which the patchset aims to solve?
Is it the latency introduced by execution of draining works on isolated cpus?
Maybe schedule these works with a delay and cancel them if the draining
occurred naturally during the delay?

Thanks!
Michal Hocko Feb. 3, 2023, 3:21 p.m. UTC | #37
OK, so this is a speculative patch. cpu_is_isolated doesn't exist yet. I
have talked to Frederic off-list and he is working on an implementation.
I will be offline next whole week (will be back Feb 13th) so I am
sending this early but this patch cannot be merged without his one of
course.

I have tried to summarize the reasoning behind both approach should we
ever need to revisit this approach. For now I strongly believe a simpler
solution should be preferred.

Roman, I have added your ack as you indicated but if you disagree with
the reasoning please let me know.
--- 
From 6d99c4d7a238809ff2eb7c362b45d002ca212c70 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 3 Feb 2023 15:54:38 +0100
Subject: [PATCH] memcg: do not drain charge pcp caches on remote isolated cpus

Leonardo Bras has noticed that pcp charge cache draining might be
disruptive on workloads relying on 'isolated cpus', a feature commonly
used on workloads that are sensitive to interruption and context
switching such as vRAN and Industrial Control Systems.

There are essentially two ways how to approach the issue. We can either
allow the pcp cache to be drained on a different rather than a local cpu
or avoid remote flushing on isolated cpus.

The current pcp charge cache is really optimized for high performance
and it always relies to stick with its cpu. That means it only requires
local_lock (preempt_disable on !RT) and draining is handed over to pcp
WQ to drain locally again.

The former solution (remote draining) would require to add an additional
locking to prevent local charges from racing with the draining. This
adds an atomic operation to otherwise simple arithmetic fast path in the
try_charge path. Another concern is that the remote draining can cause a
lock contention for the isolated workloads and therefore interfere with
it indirectly via user space interfaces.

Another option is to avoid draining scheduling on isolated cpus
altogether. That means that those remote cpus would keep their charges
even after drain_all_stock returns. This is certainly not optimal either
but it shouldn't really cause any major problems. In the worst case
(many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH
i.e 64 page) the memory consumption of a memcg would be artificially
higher than can be immediately used from other cpus.

Theoretically a memcg OOM killer could be triggered pre-maturely.
Currently it is not really clear whether this is a practical problem
though. Tight memcg limit would be really counter productive to cpu
isolated workloads pretty much by definition because any memory
reclaimed induced by memcg limit could break user space timing
expectations as those usually expect execution in the userspace most of
the time.

Also charges could be left behind on memcg removal. Any future charge on
those isolated cpus will drain that pcp cache so this won't be a
permanent leak.

Considering cons and pros of both approaches this patch is implementing
the second option and simply do not schedule remote draining if the
target cpu is isolated. This solution is much more simpler. It doesn't
add any new locking and it is more more predictable from the user space
POV. Should the pre-mature memcg OOM become a real life problem, we can
revisit this decision.

Reported-by: Leonardo Bras <leobras@redhat.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..55e440e54504 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2357,7 +2357,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
 			if (cpu == curcpu)
 				drain_local_stock(&stock->work);
-			else
+			else if (!cpu_is_isolated(cpu))
 				schedule_work_on(cpu, &stock->work);
 		}
 	}
Roman Gushchin Feb. 3, 2023, 7:25 p.m. UTC | #38
On Fri, Feb 03, 2023 at 04:21:09PM +0100, Michal Hocko wrote:
> OK, so this is a speculative patch. cpu_is_isolated doesn't exist yet. I
> have talked to Frederic off-list and he is working on an implementation.
> I will be offline next whole week (will be back Feb 13th) so I am
> sending this early but this patch cannot be merged without his one of
> course.
> 
> I have tried to summarize the reasoning behind both approach should we
> ever need to revisit this approach. For now I strongly believe a simpler
> solution should be preferred.
> 
> Roman, I have added your ack as you indicated but if you disagree with
> the reasoning please let me know.

Great, thank you for preparing it up! Really nice summary.
My ack definitely applies.

If you want, feel free to add a
"Suggested-by: Roman Gushchin <roman.gushchin@linux.dev>"
tag to blame me later :)

Thank you!
Leonardo Bras Feb. 4, 2023, 4:55 a.m. UTC | #39
Michal, Roman,

I understand you have far more experience in this codebase than myself, so
please help me understand what am I missing in my argument for the spinlock
approach. 

I honestly want to improve, and your help is really appreciated.

On Wed, 2023-02-01 at 13:41 +0100, Michal Hocko wrote:
> On Tue 31-01-23 08:35:34, Marcelo Tosatti wrote:
> [...]
> > So it would be good to point out a specific problematic 
> > testcase/scenario with using the spinlock in this particular case?
> 
> Please think about it some more. The sole purpose of the pcp charge
> caching is to avoid atomics because the normal fast path (i.e. no limit
> hit) is a page counter which is an atomic counter. If you go with a spin
> lock for the pcp cache you are just losing some of the advantage of the
> caching. Sure that would be a smaller atomics use than a deeper memcg
> hierarchy but still.

I could read the code that calls consume_stock(), and noticed that you are
probably referencing the atomic operations on memcg->memory->usage (and others)
used in page_counter_try_charge(). It is a single atomic variable per memcg that
is potentially accessed by every cpu. Is that correct?

I understand the cost of an atomic operation such as this is high because of the
inherent high cost of bouncing the variable's cacheline between those cpus.

The difference between 'usage' atomic variable and the spinlock we are proposing
is the scope of the variable: spinlock is defined on a per-cpu structure, and
most of the accesses will come from the local cpu. 

According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, at
Vol. 3A page 9-5:

###
9.1.4 Effects of a LOCK Operation on Internal Processor Caches
[...]
For the P6 and more recent processor families, if the area of memory being
locked during a LOCK operation is cached in the processor that is performing the
LOCK operation as write-back memory and is completely contained in a cache line,
the processor may not assert the LOCK# signal on the bus. Instead, it will
modify the memory location internally and allow it’s cache coherency mechanism
to ensure that the operation is carried out atomically. This operation is called
“cache locking.” The cache coherency mechanism automatically prevents two or
more processors that have cached the same area of memory from simultaneously
modifying data in that area.
###

So locking on a percpu spinlock which is cached in the current cpu (happens most
of time) is as cheap as modifying the local cacheline. Since the cachelines are
already modified in the local cpu functions, so the write to memory is batched.

For reference, this is the pahole output for memcg_stock_pcp after my patch. The
spinlock fits in the same cacheline as the changed variables.

struct memcg_stock_pcp {
	spinlock_t                 stock_lock;           /*     0     4 */
	unsigned int               nr_pages;             /*     4     4 */
	struct mem_cgroup *        cached;               /*     8     8 */
	struct obj_cgroup *        cached_objcg;         /*    16     8 */
	struct pglist_data *       cached_pgdat;         /*    24     8 */
	unsigned int               nr_bytes;             /*    32     4 */
	int                        nr_slab_reclaimable_b; /*    36     4 */
	int                        nr_slab_unreclaimable_b; /*    40     4 */

	/* size: 48, cachelines: 1, members: 8 */
	/* padding: 4 */
	/* last cacheline: 48 bytes */
};

The only accesses that will come from a remote cpu, and thus cause the cacheline
to bounce and the lock to be more expensive, are the ones from
drain_all_stock(). OTOH, on upstream, those accesses already modify the remote
cacheline with an atomic variable (memcg_stock_pcp.flags), so we are only
dealing with potential contention costs.

> 
> Not to mention a potential contention which is hard to predict and it
> will depend on the specific runtime very much. So not something that
> would be easy to test for other than artificial testcases. Full cpu
> pcp draining is not a very common operation and one could argue that
> the problem will be limited 
> 

Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
contention, for a "sometimes we hit spinlock contention".

For the spinlock proposal, on the local cpu side, the *worst case* contention
is:
1 - wait the spin_unlock() for a complete <percpu cache drain process>,
2 - wait a cache hit for local per-cpu cacheline 

What is current implemented (schedule_work_on() approach), for the local
cpu side there is *always* this contention:
1 - wait for a context switch,
2 - wait a cache hit from it's local per-cpu cacheline,
3 - wait a complete <percpu cache drain process>, 
4 - then for a new context switch to the current thread.

So moving from schedule_work_on() to spinlocks will save 2 context switches per
cpu every time drain_all_stock() is called.

On the remote cpu side, my tests point that doing the remote draining is faster
than scheduling a local draining, so it's also a gain.

Also, IIUC the possible contention in the spinlock approach happens only on
page-faulting and syscalls, versus the schedule_work_on() approach that can
interrupt user workload at any time. 

In fact, not interrupting the user workload in isolated cpus is just a bonus of
using spinlocks.

> but so far I haven't really heard any strong
> arguments against the proposal to avoid scheduling the work on isolated
> cpus which is a much more simpler solution and achieves the same
> effect.

I understand the 'not scheduling work on isolated cpus' appeal: it's a much
simpler change, and will only affect cases with isolated cpus, so it is safer
than changing the whole locking structure. 

I am not against it. I already have a patch implementing it for testing, and I
could gladly send it if you see fit.

But if nothing else, it introduces another specific case, and now it have to
deal differently with local cpu, remote cpus, and isolated cpus.

With spinlocks, there is a much simpler solution:
- All cpus are dealt with the same way, which is faster in the local cpu (as 
  in upstream implementation).
- We don't have to avoid draining on isolated cpus.
- We get rid of the "work_struct", and scheduling costs
- We get rid of "flag", as we don't need to take care of multi-scheduling local 
  draining.
- When drain_all_stock() returns, all stock have already been drained, so 
  retrying consume_stock() may have immediate results on pre-OOM cases.

On Wed, 2023-02-01 at 13:52 +0100, Michal Hocko wrote:
[...]
> Let me be clear here. Unless there are some real concerns about not
> flushing remotely on isolated cpus then the spin lock approach is just
> much less preferable. So I would much rather focus on the trivial patch
> and investigates whether there are no new corner cases to be created by
> that.

I understand 'not scheduling work on isolated cpus' approach should work with
low impact on current behavior, and I also understand that as Maintainers you
have to consider many more factors than a regular developer like me, and also
that the spinlock approach is a big change on how pcp memcg caches work. 

On that topic, if there is any comparison test you find important running, or
any topic you think could be better discussed, please let me know.

Thank you for reviewing,
Leo
Roman Gushchin Feb. 5, 2023, 7:49 p.m. UTC | #40
Hi Leonardo!

> Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
> contention, for a "sometimes we hit spinlock contention".
> 
> For the spinlock proposal, on the local cpu side, the *worst case* contention
> is:
> 1 - wait the spin_unlock() for a complete <percpu cache drain process>,
> 2 - wait a cache hit for local per-cpu cacheline 
> 
> What is current implemented (schedule_work_on() approach), for the local
> cpu side there is *always* this contention:
> 1 - wait for a context switch,
> 2 - wait a cache hit from it's local per-cpu cacheline,
> 3 - wait a complete <percpu cache drain process>, 
> 4 - then for a new context switch to the current thread.

I think both Michal and me are thinking of a more generic case in which the cpu
is not exclusively consumed by 1 special process, so that the draining work can
be executed during an idle time. In this case the work is basically free.

And the introduction of a spin_lock() on the hot path is what we're are concerned
about. I agree, that on some hardware platforms it won't be that expensive, but
in general not having any spinlocks is so much better.

> 
> So moving from schedule_work_on() to spinlocks will save 2 context switches per
> cpu every time drain_all_stock() is called.
> 
> On the remote cpu side, my tests point that doing the remote draining is faster
> than scheduling a local draining, so it's also a gain.
> 
> Also, IIUC the possible contention in the spinlock approach happens only on
> page-faulting and syscalls, versus the schedule_work_on() approach that can
> interrupt user workload at any time. 
> 
> In fact, not interrupting the user workload in isolated cpus is just a bonus of
> using spinlocks.

I believe it significantly depends on the preemption model: you're right regarding
fully preemptive kernels, but with voluntary/none preemption it's exactly opposite:
the draining work will be executed at some point later (probably with 0 cost),
while the remote access from another cpu will potentially cause delays on the
spin lock as well as a need to refill the stock.

Overall I'd expect a noticeable performance regression from an introduction of
spin locks and remote draining. Maybe not on all platforms, but at least on some.
That's my main concern. And I don't think the problem we're aiming to solve here
justifies this potential regression.

Thanks!
Leonardo Bras Feb. 7, 2023, 3:18 a.m. UTC | #41
On Sun, 2023-02-05 at 11:49 -0800, Roman Gushchin wrote:
> Hi Leonardo!

Hello Roman,
Thanks a lot for replying!

> 
> > Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
> > contention, for a "sometimes we hit spinlock contention".
> > 
> > For the spinlock proposal, on the local cpu side, the *worst case* contention
> > is:
> > 1 - wait the spin_unlock() for a complete <percpu cache drain process>,
> > 2 - wait a cache hit for local per-cpu cacheline 
> > 
> > What is current implemented (schedule_work_on() approach), for the local
> > cpu side there is *always* this contention:
> > 1 - wait for a context switch,
> > 2 - wait a cache hit from it's local per-cpu cacheline,
> > 3 - wait a complete <percpu cache drain process>, 
> > 4 - then for a new context switch to the current thread.
> 
> I think both Michal and me are thinking of a more generic case in which the cpu
> is not exclusively consumed by 1 special process, so that the draining work can
> be executed during an idle time. In this case the work is basically free.

Oh, it makes sense.
But in such scenarios, wouldn't the same happens to spinlocks?

I mean, most of the contention with spinlocks only happens if the remote cpu is
trying to drain the cache while the local cpu happens to be draining/charging,
which is quite rare due to how fast the local cpu operations are.

Also, if the cpu has some idle time, using a little more on a possible spinlock
contention should not be a problem. Right?

> 
> And the introduction of a spin_lock() on the hot path is what we're are concerned
> about. I agree, that on some hardware platforms it won't be that expensive, 
> 

IIRC most hardware platforms with multicore supported by the kernel should have
the same behavior, since it's better to rely on cache coherence than locking the
memory bus.

For instance, the other popular architectures supported by Linux use the LR/SC
strategy for atomic operations (tested on ARM, POWER, RISCV) and IIRC the
LoadReserve slow part waits for the cacheline exclusivity, which is already
already exclusive in this perCPU structure.


> but in general not having any spinlocks is so much better.

I agree that spinlocks may bring contention, which is not ideal in many cases.
In this case, though, it may not be a big issue, due to very rare remote access
in the structure, for the usual case (non-pre-OOMCG)

> 
> > 
> > So moving from schedule_work_on() to spinlocks will save 2 context switches per
> > cpu every time drain_all_stock() is called.
> > 
> > On the remote cpu side, my tests point that doing the remote draining is faster
> > than scheduling a local draining, so it's also a gain.
> > 
> > Also, IIUC the possible contention in the spinlock approach happens only on
> > page-faulting and syscalls, versus the schedule_work_on() approach that can
> > interrupt user workload at any time. 
> > 
> > In fact, not interrupting the user workload in isolated cpus is just a bonus of
> > using spinlocks.
> 
> I believe it significantly depends on the preemption model: you're right regarding
> fully preemptive kernels, but with voluntary/none preemption it's exactly opposite:
> the draining work will be executed at some point later (probably with 0 cost),

So, in case of voluntary/none preemption with some free cpu time. 

> while the remote access from another cpu will potentially cause delays on the
> spin lock as well as a need to refill the stock.

But if there is some free CPU time, what is the issue of some (potential) delays
due to spinlock contention?

I am probably missing the whole picture, but when I think of performance
improvement, I think on doing more with the same cputime. If we can use free
cputime to do stuff later, it's only fair to also use it in case of contention,
right?

I know there are some cases that may need to be more previsible (mostly RT), but
when I think of memory allocation, I don't expect it to always take the same
time (as there are caches, pre-OOM, and so)

Also, as previously discussed, in case of a busy cpu, the spinlock approach will
probably allow more work to be done.

> 
> Overall I'd expect a noticeable performance regression from an introduction of
> spin locks and remote draining. Maybe not on all platforms, but at least on some.
> That's my main concern.
> 

I see. 
For the platform I have tested (x86) I noticed better overall performance on
spinlocks than upstream solution. For other popular platforms, I have briefly
read some documentation on locking/atomicity and I think we may keep the
performance gains.

But to be sure, I could retake the tests on other platforms, such as ARM, POWER,
RISCV, and so. Or even perform extra suggested tests.

With that info, would you feel less concerned about a possible change in memcg
pcp cache locking scheme?


>  And I don't think the problem we're aiming to solve here
> justifies this potential regression.
> 

To be strict, the isolated cpu scheduling problem is already fixed by the
housekeeping patch (with some limitations). 

At this point, I am trying to bring focus to a (possible) performance
improvement on the memcg pcp cache locking system.


> Thanks!
> 

Thank you for helping me better understand your arguments and concerns.
I really appreciate it!

Best regards,
Leo
Michal Hocko Feb. 13, 2023, 1:36 p.m. UTC | #42
On Fri 03-02-23 11:25:16, Roman Gushchin wrote:
> On Fri, Feb 03, 2023 at 04:21:09PM +0100, Michal Hocko wrote:
> > OK, so this is a speculative patch. cpu_is_isolated doesn't exist yet. I
> > have talked to Frederic off-list and he is working on an implementation.
> > I will be offline next whole week (will be back Feb 13th) so I am
> > sending this early but this patch cannot be merged without his one of
> > course.
> > 
> > I have tried to summarize the reasoning behind both approach should we
> > ever need to revisit this approach. For now I strongly believe a simpler
> > solution should be preferred.
> > 
> > Roman, I have added your ack as you indicated but if you disagree with
> > the reasoning please let me know.
> 
> Great, thank you for preparing it up! Really nice summary.
> My ack definitely applies.
> 
> If you want, feel free to add a
> "Suggested-by: Roman Gushchin <roman.gushchin@linux.dev>"
> tag to blame me later :)

Sure, I have updated the changelog. I will repost after the patchset by
Frederic[1] is merged.

[1] http://lkml.kernel.org/r/20230203232409.163847-1-frederic@kernel.org