diff mbox series

mm: Do not deactivate when the cgroup has plenty inactive page

Message ID 1601034552-95831-1-git-send-email-zhongjiang-ali@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series mm: Do not deactivate when the cgroup has plenty inactive page | expand

Commit Message

zhong jiang Sept. 25, 2020, 11:49 a.m. UTC
After appling the series patches(mm: fix page aging across multiple cgroups),
cgroup memory reclaim strategy is based on reclaim root's inactive:active
ratio. if the target lruvec need to deactivate, its children cgroup also will
deactivate. That will result in hot page to be reclaimed and other cgroup's
cold page will be left, which is not expected.

The patch will not force deactivate when inactive_is_low is not true unless
we has scanned the inactive list and memory is unable to reclaim.

Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
---
 mm/vmscan.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Michal Hocko Sept. 25, 2020, 12:07 p.m. UTC | #1
On Fri 25-09-20 19:49:12, zhongjiang-ali wrote:
> After appling the series patches(mm: fix page aging across multiple cgroups),
> cgroup memory reclaim strategy is based on reclaim root's inactive:active
> ratio. if the target lruvec need to deactivate, its children cgroup also will
> deactivate. That will result in hot page to be reclaimed and other cgroup's
> cold page will be left, which is not expected.
> 
> The patch will not force deactivate when inactive_is_low is not true unless
> we has scanned the inactive list and memory is unable to reclaim.

Do you have any data to present?

> Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
> ---
>  mm/vmscan.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 466fc31..77d395f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2407,8 +2407,21 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  		case SCAN_FILE:
>  		case SCAN_ANON:
>  			/* Scan one type exclusively */
> -			if ((scan_balance == SCAN_FILE) != file)
> +			if ((scan_balance == SCAN_FILE) != file) {
>  				scan = 0;
> +			} else {
> +				/*
> +				 * Reclaim memory is based on the root's inactive: active
> +				 * ratio, but it is possible that silbing cgroup has a lot
> +				 * of cold memory to reclaim rather than reclaim the hot
> +				 * cache in the current cgroup.
> +				 */
> +				if (!sc->force_deactivate && is_active_lru(lru) &&
> +						!inactive_is_low(lruvec, lru - LRU_ACTIVE)) {
> +					if (sc->may_deactivate & (1 << file))
> +						scan = 0;
> +				}
> +			}
>  			break;
>  		default:
>  			/* Look ma, no brain */
> -- 
> 1.8.3.1
>
zhong jiang Sept. 27, 2020, 2:39 a.m. UTC | #2
On 2020/9/25 8:07 下午, Michal Hocko wrote:
> On Fri 25-09-20 19:49:12, zhongjiang-ali wrote:
>> After appling the series patches(mm: fix page aging across multiple cgroups),
>> cgroup memory reclaim strategy is based on reclaim root's inactive:active
>> ratio. if the target lruvec need to deactivate, its children cgroup also will
>> deactivate. That will result in hot page to be reclaimed and other cgroup's
>> cold page will be left, which is not expected.
>>
>> The patch will not force deactivate when inactive_is_low is not true unless
>> we has scanned the inactive list and memory is unable to reclaim.
> Do you have any data to present?

I write an testcase that cgroup B has a lot of hot pagecache and cgroup 
C  is full of cold pagecache.  and

their parent cgroup A will trigger the reclaim due of it's limit has 
been breached.


The testcase should assume that we should not reclaim the  hot pagecache 
in cgroup B because C has

plenty cold pagecache.   Unfortunately,  I can see cgroup B hot 
pagecache has been decreased when

cgroup A trigger the reclaim.


Thanks,

>
>> Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
>> ---
>>   mm/vmscan.c | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 466fc31..77d395f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2407,8 +2407,21 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>>   		case SCAN_FILE:
>>   		case SCAN_ANON:
>>   			/* Scan one type exclusively */
>> -			if ((scan_balance == SCAN_FILE) != file)
>> +			if ((scan_balance == SCAN_FILE) != file) {
>>   				scan = 0;
>> +			} else {
>> +				/*
>> +				 * Reclaim memory is based on the root's inactive: active
>> +				 * ratio, but it is possible that silbing cgroup has a lot
>> +				 * of cold memory to reclaim rather than reclaim the hot
>> +				 * cache in the current cgroup.
>> +				 */
>> +				if (!sc->force_deactivate && is_active_lru(lru) &&
>> +						!inactive_is_low(lruvec, lru - LRU_ACTIVE)) {
>> +					if (sc->may_deactivate & (1 << file))
>> +						scan = 0;
>> +				}
>> +			}
>>   			break;
>>   		default:
>>   			/* Look ma, no brain */
>> -- 
>> 1.8.3.1
>>
Michal Hocko Sept. 29, 2020, 4:02 p.m. UTC | #3
On Sun 27-09-20 10:39:22, zhong jiang wrote:
> 
> On 2020/9/25 8:07 下午, Michal Hocko wrote:
> > On Fri 25-09-20 19:49:12, zhongjiang-ali wrote:
> > > After appling the series patches(mm: fix page aging across multiple cgroups),
> > > cgroup memory reclaim strategy is based on reclaim root's inactive:active
> > > ratio. if the target lruvec need to deactivate, its children cgroup also will
> > > deactivate. That will result in hot page to be reclaimed and other cgroup's
> > > cold page will be left, which is not expected.
> > > 
> > > The patch will not force deactivate when inactive_is_low is not true unless
> > > we has scanned the inactive list and memory is unable to reclaim.
> > Do you have any data to present?
> 
> I write an testcase that cgroup B has a lot of hot pagecache and cgroup C 
> is full of cold pagecache.  and
> 
> their parent cgroup A will trigger the reclaim due of it's limit has been
> breached.
> 
> 
> The testcase should assume that we should not reclaim the  hot pagecache in
> cgroup B because C has
> 
> plenty cold pagecache.   Unfortunately,  I can see cgroup B hot pagecache
> has been decreased when
> 
> cgroup A trigger the reclaim.

Thank you, this is more or less what've expected from your initial
description. An extended form would be preferable for the changelog to
make the setup more clear. But you still haven't quantified the effect.
It would be also good to describe what is the effect on the workload
described by 53138cea7f39 ("mm: vmscan: move file exhaustion detection
to the node level"). A more extended description on the implementation
would be nice as well.
zhong jiang Oct. 14, 2020, 2:21 p.m. UTC | #4
On 2020/9/30 12:02 上午, Michal Hocko wrote:
> On Sun 27-09-20 10:39:22, zhong jiang wrote:
>> On 2020/9/25 8:07 下午, Michal Hocko wrote:
>>> On Fri 25-09-20 19:49:12, zhongjiang-ali wrote:
>>>> After appling the series patches(mm: fix page aging across multiple cgroups),
>>>> cgroup memory reclaim strategy is based on reclaim root's inactive:active
>>>> ratio. if the target lruvec need to deactivate, its children cgroup also will
>>>> deactivate. That will result in hot page to be reclaimed and other cgroup's
>>>> cold page will be left, which is not expected.
>>>>
>>>> The patch will not force deactivate when inactive_is_low is not true unless
>>>> we has scanned the inactive list and memory is unable to reclaim.
>>> Do you have any data to present?
>> I write an testcase that cgroup B has a lot of hot pagecache and cgroup C
>> is full of cold pagecache.  and
>>
>> their parent cgroup A will trigger the reclaim due of it's limit has been
>> breached.
>>
>>
>> The testcase should assume that we should not reclaim the  hot pagecache in
>> cgroup B because C has
>>
>> plenty cold pagecache.   Unfortunately,  I can see cgroup B hot pagecache
>> has been decreased when
>>
>> cgroup A trigger the reclaim.
> Thank you, this is more or less what've expected from your initial
> description. An extended form would be preferable for the changelog to
> make the setup more clear. But you still haven't quantified the effect.
> It would be also good to describe what is the effect on the workload
> described by 53138cea7f39 ("mm: vmscan: move file exhaustion detection
> to the node level"). A more extended description on the implementation
> would be nice as well.

Hi,  Michal

I'm sorry for lately reply due of a long vacation.  But that indeed 
breach the initial purpose.

Do you think the patch make sense, or any benchmark can be recommended 
to get some data.

Thank,

zhing jiang
Michal Hocko Oct. 14, 2020, 2:47 p.m. UTC | #5
On Wed 14-10-20 22:21:58, zhong jiang wrote:
> 
> On 2020/9/30 12:02 上午, Michal Hocko wrote:
> > On Sun 27-09-20 10:39:22, zhong jiang wrote:
> > > On 2020/9/25 8:07 下午, Michal Hocko wrote:
> > > > On Fri 25-09-20 19:49:12, zhongjiang-ali wrote:
> > > > > After appling the series patches(mm: fix page aging across multiple cgroups),
> > > > > cgroup memory reclaim strategy is based on reclaim root's inactive:active
> > > > > ratio. if the target lruvec need to deactivate, its children cgroup also will
> > > > > deactivate. That will result in hot page to be reclaimed and other cgroup's
> > > > > cold page will be left, which is not expected.
> > > > > 
> > > > > The patch will not force deactivate when inactive_is_low is not true unless
> > > > > we has scanned the inactive list and memory is unable to reclaim.
> > > > Do you have any data to present?
> > > I write an testcase that cgroup B has a lot of hot pagecache and cgroup C
> > > is full of cold pagecache.  and
> > > 
> > > their parent cgroup A will trigger the reclaim due of it's limit has been
> > > breached.
> > > 
> > > 
> > > The testcase should assume that we should not reclaim the  hot pagecache in
> > > cgroup B because C has
> > > 
> > > plenty cold pagecache.   Unfortunately,  I can see cgroup B hot pagecache
> > > has been decreased when
> > > 
> > > cgroup A trigger the reclaim.
> > Thank you, this is more or less what've expected from your initial
> > description. An extended form would be preferable for the changelog to
> > make the setup more clear. But you still haven't quantified the effect.
> > It would be also good to describe what is the effect on the workload
> > described by 53138cea7f39 ("mm: vmscan: move file exhaustion detection
> > to the node level"). A more extended description on the implementation
> > would be nice as well.
> 
> Hi,  Michal
> 
> I'm sorry for lately reply due of a long vacation.  But that indeed breach
> the initial purpose.
> 
> Do you think the patch make sense, or any benchmark can be recommended to
> get some data.

To be honest I have really hard time to grasp what is the effect of this
patch. Also memory reclaim tuning without any data doesn't really sound
convincing. Do you see any real workload that is benefiting from this
change or this is mostly a based on reading the code and a theoretical
concern?

Please understand that the existing balancing is quite complex and any
changes should be carfully analyzed and described.
zhong jiang Oct. 14, 2020, 4:57 p.m. UTC | #6
On 2020/10/14 10:47 下午, Michal Hocko wrote:
> On Wed 14-10-20 22:21:58, zhong jiang wrote:
>> On 2020/9/30 12:02 上午, Michal Hocko wrote:
>>> On Sun 27-09-20 10:39:22, zhong jiang wrote:
>>>> On 2020/9/25 8:07 下午, Michal Hocko wrote:
>>>>> On Fri 25-09-20 19:49:12, zhongjiang-ali wrote:
>>>>>> After appling the series patches(mm: fix page aging across multiple cgroups),
>>>>>> cgroup memory reclaim strategy is based on reclaim root's inactive:active
>>>>>> ratio. if the target lruvec need to deactivate, its children cgroup also will
>>>>>> deactivate. That will result in hot page to be reclaimed and other cgroup's
>>>>>> cold page will be left, which is not expected.
>>>>>>
>>>>>> The patch will not force deactivate when inactive_is_low is not true unless
>>>>>> we has scanned the inactive list and memory is unable to reclaim.
>>>>> Do you have any data to present?
>>>> I write an testcase that cgroup B has a lot of hot pagecache and cgroup C
>>>> is full of cold pagecache.  and
>>>>
>>>> their parent cgroup A will trigger the reclaim due of it's limit has been
>>>> breached.
>>>>
>>>>
>>>> The testcase should assume that we should not reclaim the  hot pagecache in
>>>> cgroup B because C has
>>>>
>>>> plenty cold pagecache.   Unfortunately,  I can see cgroup B hot pagecache
>>>> has been decreased when
>>>>
>>>> cgroup A trigger the reclaim.
>>> Thank you, this is more or less what've expected from your initial
>>> description. An extended form would be preferable for the changelog to
>>> make the setup more clear. But you still haven't quantified the effect.
>>> It would be also good to describe what is the effect on the workload
>>> described by 53138cea7f39 ("mm: vmscan: move file exhaustion detection
>>> to the node level"). A more extended description on the implementation
>>> would be nice as well.
>> Hi,  Michal
>>
>> I'm sorry for lately reply due of a long vacation.  But that indeed breach
>> the initial purpose.
>>
>> Do you think the patch make sense, or any benchmark can be recommended to
>> get some data.
> To be honest I have really hard time to grasp what is the effect of this
> patch. Also memory reclaim tuning without any data doesn't really sound
> convincing. Do you see any real workload that is benefiting from this
> change or this is mostly a based on reading the code and a theoretical
> concern?

I write an testcase based on LTP to test the condition  and see active 
page will be deactived but

child cgroup still has a lot of cold memory, which is not unexpected.

The testcase is attached.

> Please understand that the existing balancing is quite complex and any
> changes should be carfully analyzed and described.
>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/sysinfo.h>
#include <sys/wait.h>

#include "tst_test.h"

#define HOT_CACHE	(1024 * 1024 * 500UL)
#define COLD_CACHE	(1024 * 1024 * 1024 * 2UL)
#define MEMORY_MAX	(COLD_CACHE / 2)
#define HOT_MAX		(1024 * 1024 * 400UL)
#define MEMCG_ROOT 	"/sys/fs/cgroup/memory"
#define MEMCG_CHILD_HOT		"memcg_test_0"
#define MEMCG_CHILD_COLD	"memcg_test_1"
#define tempfile_hot	"memcg_test.hot"
#define tempfile_cold	"memcg_test.cold"
#define TEST_PASS	0
#define TEST_FAIL	-1

static inline int values_close(long a, long b, int err)
{
    return abs(a - b) <= (a + b) / 100 * err;
}

static int cg_create(char *path, char *parent, char *memcg)
{
	sprintf(path, "%s/%s", parent, memcg);

	return mkdir(path, 0644);
}

static int cg_write(char *memcg, char *control, char *buf, ssize_t len)
{
	char path[PATH_MAX];
	int fd;

	sprintf(path, "%s/%s", memcg, control);
	fd = open(path, O_WRONLY);
	if (fd < 0)
		return fd;
	len = write(fd, buf, len);
	close(fd);

	return len;
}

static int cg_move(char *cgroup)
{
	char pid_name[10] = {0,};
	char buf[PATH_MAX];

	strcpy(buf, cgroup);
	sprintf(pid_name, "%d", getpid());

	return cg_write(dirname(buf), "cgroup.procs", pid_name, strlen(pid_name));
}

static int cg_destroy(char *parent, char *cgroup)
{
	char path[PATH_MAX];
	int ret;
	int loops = 1;

	if (!cgroup)
		sprintf(path, "%s", parent);
	else
		sprintf(path, "%s/%s", parent, cgroup);
retry:
	ret = rmdir(path);
	if (ret && errno == EBUSY && loops) {
		ret = cg_move(path);
		if (ret < 0)
			goto out;
		else {
			loops--;
			goto retry;
		}
	} else if (ret && errno == ENOENT)
		ret = 0;
out:
	return ret;
}

static int open_temp_file(char *name)
{
    return open(name, O_CREAT | O_RDWR | O_TRUNC, S_IRWXU);
}

static void remove_temp_file(char *name)
{
    unlink(name);
}

static void cleanup(void)
{
	char parent[PATH_MAX];

	if (!access(tempfile_hot, F_OK))
		remove_temp_file(tempfile_hot);
	if (!access(tempfile_cold, F_OK))
		remove_temp_file(tempfile_cold);

	sprintf(parent, "%s/%s", MEMCG_ROOT, "memcg_test");
	if (!access(parent, F_OK)) {
		cg_destroy(parent, MEMCG_CHILD_HOT);
		cg_destroy(parent, MEMCG_CHILD_COLD);
		cg_destroy(parent, NULL);
	}
}

static int cg_write_max(char *memcg, unsigned long max)
{
	char buf[100];

	sprintf(buf, "%ld", max);
	return cg_write(memcg, "memory.limit_in_bytes", buf, 100);
}

static void remove_temp_files(void)
{
	remove_temp_file(tempfile_hot);
	remove_temp_file(tempfile_cold);
}

static int read_file(int fd, unsigned long size, char *buf, int pagesize)
{
	unsigned long i;
	int len = 0;

	lseek(fd, 0, SEEK_SET);
	for (i = 0; i < size; i += pagesize) {
		len = read(fd, buf, pagesize);
		if (len < 0)
			break;
	}

	return len;
}

static void memcg_active_file(char *path, int fd, unsigned long size)
{
	char *buf;
	char *value;
	int pagesize = getpagesize();

	value = malloc(pagesize);
	buf = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	memset(buf, 'A', size);

	/* trigger memory reclaim to putback inactive page to active list */
	cg_write_max(path, HOT_MAX);
	munmap(buf, size);
	/* mark_page_accessed will make the page to be activated */
	read_file(fd, size, value, pagesize);
	free(value);
}

static int touch_pagecache(int fd, unsigned long size)
{
	char *buf;
	int pagesize = getpagesize();
	int len = -1;

	buf = malloc(pagesize);
	if (ftruncate(fd, size))
		goto out;

	len = read_file(fd, size, buf, pagesize);
out:
	free(buf);
	return len;
}
 
static int cg_read_stat(char *memcg, char *control, ssize_t *value)
{
	char path[PATH_MAX];
	char str[1024], temp[20];
	int found = 0;
	FILE *fp;

	sprintf(path, "%s/%s", memcg, "memory.stat");
	fp = fopen(path, "r");
	if (!fp)
		goto out;

	while (fgets(str, 1024, fp) != NULL) {
		if (!strncmp(str, control, strlen(control))) {
			sscanf(str, "%s %ld", temp, value);
			found = 1;
			break;
		} else
			continue;
	}

	fclose(fp);
	if (found)
		return 0;
out:
	return -1;
}

static int cg_write_conf(char *path, char *parent, char *memcg)
{
	char pid[10];

	if (cg_create(path, parent, memcg))
		goto out;

	sprintf(pid, "%d", getpid());
	cg_write(path, "cgroup.procs", pid, strlen(pid));

	return 0;
out:
	return -1;
}

static int memcg_create_cold(char *parent, char *memcg)
{
	int fd, ret;
	char path[PATH_MAX] = {0};
	pid_t pid;

	pid = fork();
	if (pid > 0) {
		if (wait(NULL) < 0)
			return -1;
		else
			return 0;
	} else if (pid == 0) {
		ret = cg_write_conf(path, parent, memcg);
		if (ret < 0)
			goto out;

		fd = open_temp_file(tempfile_cold);
		if (fd < 0)
			goto out_destroy;

		if (touch_pagecache(fd, COLD_CACHE) < 0)
			goto out_fd;

		close(fd);
		exit(0);
	} else
		return -1;

out_fd:
	close(fd);
	remove_temp_file(tempfile_cold);
out_destroy:
	cg_destroy(path, NULL);
out:
	exit(-1);
}

static int memcg_create_hot(char *parent, char *memcg)
{
	int ret, fd;
	char path[PATH_MAX];
	pid_t pid;

	pid = fork();
	if (pid > 0) {
		if (wait(NULL) < 0)
			return -1;
		else
			return 0;
	} else if (pid == 0) {
		ret = cg_write_conf(path, parent, memcg);
		if (ret < 0)
			goto out;

		fd = open_temp_file(tempfile_hot);
		if (fd < 0)
			goto out_destroy;

		if (touch_pagecache(fd, HOT_CACHE) < 0)
			goto out_fd;
		memcg_active_file(path, fd, HOT_CACHE);

		close(fd);
		exit(0);
	} else
		return -1;

out_fd:
	close(fd);
	remove_temp_file(tempfile_hot);
out_destroy:
	cg_destroy(path, NULL);
out:
	exit(-1);
}

static void memcg_get_active_pages(char *parent, char *memcg, long *active)
{
	char path[PATH_MAX];

	sprintf(path, "%s/%s", parent, memcg);
	cg_read_stat(path, "active_file", active);
}

static int memcg_measure_compare(char *parent, char *memcg, ssize_t size)
{
	long active;

	memcg_get_active_pages(parent, memcg, &active);
	if (values_close(active, size, 3))
		return 0;
	else
		return -1;
}

static void memcg_balance_run(void)
{
	char parent[PATH_MAX];
	int ret;
	long hot_active;

	if (access(MEMCG_ROOT, F_OK))
		tst_brk(TCONF, "memory has not been mounted");

	ret = cg_write_conf(parent, MEMCG_ROOT, "memcg_test");
	if (ret)
		goto out;

	ret = memcg_create_hot(parent, MEMCG_CHILD_HOT);
	if (ret)
		goto out_parent;

	memcg_get_active_pages(parent, MEMCG_CHILD_HOT, &hot_active);
	ret = memcg_create_cold(parent, MEMCG_CHILD_COLD);
	if (ret)
		goto out_child_hot;

	/* set parent memcg's limit_in_bytes to trigger memory reclaim. */
	ret = cg_write_max(parent, MEMORY_MAX);
	if (ret < 0)
		goto out_parent;

	/* Assume that we should not less that the specified value within a certain deviation */
	ret = memcg_measure_compare(parent, MEMCG_CHILD_HOT, hot_active);
	if (ret)
		goto out_child_cold;

	remove_temp_files();

out_child_cold:
	cg_destroy(parent, MEMCG_CHILD_COLD);
out_child_hot:
	cg_destroy(parent, MEMCG_CHILD_HOT);
out_parent:
	cg_destroy(parent, NULL);
out:
	if (ret == TEST_FAIL)
		tst_res(TFAIL, "memcg balance");
	else
		tst_res(TPASS, "memcg balance");
}

static struct tst_test test = {
	.test_all = memcg_balance_run,
	.min_kver = "4.19",
	.needs_root = 1,
	.cleanup = cleanup,
	.timeout = 3600,
};
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 466fc31..77d395f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2407,8 +2407,21 @@  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		case SCAN_FILE:
 		case SCAN_ANON:
 			/* Scan one type exclusively */
-			if ((scan_balance == SCAN_FILE) != file)
+			if ((scan_balance == SCAN_FILE) != file) {
 				scan = 0;
+			} else {
+				/*
+				 * Reclaim memory is based on the root's inactive: active
+				 * ratio, but it is possible that silbing cgroup has a lot
+				 * of cold memory to reclaim rather than reclaim the hot
+				 * cache in the current cgroup.
+				 */
+				if (!sc->force_deactivate && is_active_lru(lru) &&
+						!inactive_is_low(lruvec, lru - LRU_ACTIVE)) {
+					if (sc->may_deactivate & (1 << file))
+						scan = 0;
+				}
+			}
 			break;
 		default:
 			/* Look ma, no brain */