diff mbox series

mm: Move check for SHRINKER_NUMA_AWARE to do_shrink_slab()

Message ID 153320759911.18959.8842396230157677671.stgit@localhost.localdomain (mailing list archive)
State New, archived
Headers show
Series mm: Move check for SHRINKER_NUMA_AWARE to do_shrink_slab() | expand

Commit Message

Kirill Tkhai Aug. 2, 2018, 11 a.m. UTC
In case of shrink_slab_memcg() we do not zero nid, when shrinker
is not numa-aware. This is not a real problem, since currently
all memcg-aware shrinkers are numa-aware too (we have two:
super_block shrinker and workingset shrinker), but something may
change in the future.

(Andrew, this may be merged to mm-iterate-only-over-charged-shrinkers-during-memcg-shrink_slab)

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 mm/vmscan.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Yang Shi Aug. 2, 2018, 4:47 p.m. UTC | #1
On Thu, Aug 2, 2018 at 4:00 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> In case of shrink_slab_memcg() we do not zero nid, when shrinker
> is not numa-aware. This is not a real problem, since currently
> all memcg-aware shrinkers are numa-aware too (we have two:

Actually, this is not true. huge_zero_page_shrinker is NOT numa-aware.
deferred_split_shrinker is numa-aware.

Thanks,
Yang


> super_block shrinker and workingset shrinker), but something may
> change in the future.
>
> (Andrew, this may be merged to mm-iterate-only-over-charged-shrinkers-during-memcg-shrink_slab)
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  mm/vmscan.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ea0a46166e8e..0d980e801b8a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -455,6 +455,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>                                           : SHRINK_BATCH;
>         long scanned = 0, next_deferred;
>
> +       if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +               nid = 0;
> +
>         freeable = shrinker->count_objects(shrinker, shrinkctl);
>         if (freeable == 0 || freeable == SHRINK_EMPTY)
>                 return freeable;
> @@ -680,9 +683,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>                         .memcg = memcg,
>                 };
>
> -               if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> -                       sc.nid = 0;
> -
>                 ret = do_shrink_slab(&sc, shrinker, priority);
>                 if (ret == SHRINK_EMPTY)
>                         ret = 0;
>
Shakeel Butt Aug. 2, 2018, 4:54 p.m. UTC | #2
On Thu, Aug 2, 2018 at 9:47 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Aug 2, 2018 at 4:00 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> > In case of shrink_slab_memcg() we do not zero nid, when shrinker
> > is not numa-aware. This is not a real problem, since currently
> > all memcg-aware shrinkers are numa-aware too (we have two:
>
> Actually, this is not true. huge_zero_page_shrinker is NOT numa-aware.
> deferred_split_shrinker is numa-aware.
>

But both huge_zero_page_shrinker and huge_zero_page_shrinker are not
memcg-aware shrinkers. I think Kirill is saying all memcg-aware
shrinkers are also numa-aware shrinkers.

Shakeel
Yang Shi Aug. 2, 2018, 5:26 p.m. UTC | #3
On Thu, Aug 2, 2018 at 9:54 AM, Shakeel Butt <shakeelb@google.com> wrote:
> On Thu, Aug 2, 2018 at 9:47 AM Yang Shi <shy828301@gmail.com> wrote:
>>
>> On Thu, Aug 2, 2018 at 4:00 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> > In case of shrink_slab_memcg() we do not zero nid, when shrinker
>> > is not numa-aware. This is not a real problem, since currently
>> > all memcg-aware shrinkers are numa-aware too (we have two:
>>
>> Actually, this is not true. huge_zero_page_shrinker is NOT numa-aware.
>> deferred_split_shrinker is numa-aware.
>>
>
> But both huge_zero_page_shrinker and huge_zero_page_shrinker are not
> memcg-aware shrinkers. I think Kirill is saying all memcg-aware
> shrinkers are also numa-aware shrinkers.

Aha, thanks for reminding. Yes, I missed that memcg-aware part.

>
> Shakeel
Andrew Morton Aug. 2, 2018, 8:47 p.m. UTC | #4
On Thu, 02 Aug 2018 14:00:52 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote:

> In case of shrink_slab_memcg() we do not zero nid, when shrinker
> is not numa-aware. This is not a real problem, since currently
> all memcg-aware shrinkers are numa-aware too (we have two:
> super_block shrinker and workingset shrinker), but something may
> change in the future.

Fair enough.

> (Andrew, this may be merged to mm-iterate-only-over-charged-shrinkers-during-memcg-shrink_slab)

It got a bit messy so I got lazy and queued it as a separate patch.

btw, I have a note that https://lkml.org/lkml/2018/7/7/32 was caused by
this patch series.  Is that the case and do you know if this was
addressed?
Kirill Tkhai Aug. 3, 2018, 7:11 a.m. UTC | #5
On 02.08.2018 20:26, Yang Shi wrote:
> On Thu, Aug 2, 2018 at 9:54 AM, Shakeel Butt <shakeelb@google.com> wrote:
>> On Thu, Aug 2, 2018 at 9:47 AM Yang Shi <shy828301@gmail.com> wrote:
>>>
>>> On Thu, Aug 2, 2018 at 4:00 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> In case of shrink_slab_memcg() we do not zero nid, when shrinker
>>>> is not numa-aware. This is not a real problem, since currently
>>>> all memcg-aware shrinkers are numa-aware too (we have two:
>>>
>>> Actually, this is not true. huge_zero_page_shrinker is NOT numa-aware.
>>> deferred_split_shrinker is numa-aware.
>>>
>>
>> But both huge_zero_page_shrinker and huge_zero_page_shrinker are not
>> memcg-aware shrinkers. I think Kirill is saying all memcg-aware
>> shrinkers are also numa-aware shrinkers.
> 
> Aha, thanks for reminding. Yes, I missed that memcg-aware part.

Yes, I mean workingset_shadow_shrinker.
Kirill Tkhai Aug. 3, 2018, 9:02 a.m. UTC | #6
On 02.08.2018 23:47, Andrew Morton wrote:
> On Thu, 02 Aug 2018 14:00:52 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> 
>> In case of shrink_slab_memcg() we do not zero nid, when shrinker
>> is not numa-aware. This is not a real problem, since currently
>> all memcg-aware shrinkers are numa-aware too (we have two:
>> super_block shrinker and workingset shrinker), but something may
>> change in the future.
> 
> Fair enough.
> 
>> (Andrew, this may be merged to mm-iterate-only-over-charged-shrinkers-during-memcg-shrink_slab)
> 
> It got a bit messy so I got lazy and queued it as a separate patch.
> 
> btw, I have a note that https://lkml.org/lkml/2018/7/7/32 was caused by
> this patch series.  Is that the case and do you know if this was
> addressed?

It's not related to the patchset. Bisect leads to:

commit c6aeb9d4c351 (HEAD, refs/bisect/bad)
Author: David Howells <dhowells@redhat.com>
Date:   Sun Jun 24 00:20:10 2018 +0100

kernfs, sysfs, cgroup, intel_rdt: Support fs_context

CC David.

David, please see reproducer at https://lkml.org/lkml/2018/7/7/32

Kirill
David Howells Aug. 3, 2018, 10:31 a.m. UTC | #7
The reproducer can be reduced to:

	#define _GNU_SOURCE
	#include <endian.h>
	#include <stdint.h>
	#include <string.h>
	#include <stdio.h>
	#include <sys/syscall.h>
	#include <sys/stat.h>
	#include <sys/mount.h>
	#include <unistd.h>
	#include <fcntl.h>

	const char path[] = "./file0";

	int main()
	{
		mkdir(path, 0);
		mount(path, path, "cgroup2", 0, 0);
		chroot(path);
		umount2(path, 0);
		return 0;
	}

and I've found two bugs (see attached patch).  The issue is that
do_remount_sb() is called with fc == NULL from umount(), but both
cgroup_reconfigure() and do_remount_sb() dereference fc unconditionally.

But!  I'm not sure why the reproducer works at all because the umount2() call
is *after* the chroot, so should fail on ENOENT before it even gets that far.
In fact, umount2() can be called multiple times, apparently successfully, and
doesn't actually unmount anything.

David
---
diff --git a/fs/super.c b/fs/super.c
index 3fe5d12b7697..321fbc244570 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -978,7 +978,10 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
 	    sb->s_op->remount_fs) {
 		if (sb->s_op->reconfigure) {
 			retval = sb->s_op->reconfigure(sb, fc);
-			sb_flags = fc->sb_flags;
+			if (fc)
+				sb_flags = fc->sb_flags;
+			else
+				sb_flags = sb->s_flags;
 			if (retval == 0)
 				security_sb_reconfigure(fc);
 		} else {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f3238f38d152..48275fdce053 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1796,9 +1796,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 
 static int cgroup_reconfigure(struct kernfs_root *kf_root, struct fs_context *fc)
 {
-	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+	if (fc) {
+		struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 
-	apply_cgroup_root_flags(ctx->flags);
+		apply_cgroup_root_flags(ctx->flags);
+	}
 	return 0;
 }
Kirill Tkhai Aug. 3, 2018, 10:59 a.m. UTC | #8
On 03.08.2018 13:31, David Howells wrote:
> The reproducer can be reduced to:
> 
> 	#define _GNU_SOURCE
> 	#include <endian.h>
> 	#include <stdint.h>
> 	#include <string.h>
> 	#include <stdio.h>
> 	#include <sys/syscall.h>
> 	#include <sys/stat.h>
> 	#include <sys/mount.h>
> 	#include <unistd.h>
> 	#include <fcntl.h>
> 
> 	const char path[] = "./file0";
> 
> 	int main()
> 	{
> 		mkdir(path, 0);
> 		mount(path, path, "cgroup2", 0, 0);
> 		chroot(path);
> 		umount2(path, 0);
> 		return 0;
> 	}
> 
> and I've found two bugs (see attached patch).  The issue is that
> do_remount_sb() is called with fc == NULL from umount(), but both
> cgroup_reconfigure() and do_remount_sb() dereference fc unconditionally.
>
> But!  I'm not sure why the reproducer works at all because the umount2() call
> is *after* the chroot, so should fail on ENOENT before it even gets that far.
> In fact, umount2() can be called multiple times, apparently successfully, and
> doesn't actually unmount anything.

Before I also try to check why it works; just reporting you that the patch
works the problem in my environment. Thanks, David.

> ---
> diff --git a/fs/super.c b/fs/super.c
> index 3fe5d12b7697..321fbc244570 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -978,7 +978,10 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
>  	    sb->s_op->remount_fs) {
>  		if (sb->s_op->reconfigure) {
>  			retval = sb->s_op->reconfigure(sb, fc);
> -			sb_flags = fc->sb_flags;
> +			if (fc)
> +				sb_flags = fc->sb_flags;
> +			else
> +				sb_flags = sb->s_flags;
>  			if (retval == 0)
>  				security_sb_reconfigure(fc);
>  		} else {
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index f3238f38d152..48275fdce053 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1796,9 +1796,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
>  
>  static int cgroup_reconfigure(struct kernfs_root *kf_root, struct fs_context *fc)
>  {
> -	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> +	if (fc) {
> +		struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
>  
> -	apply_cgroup_root_flags(ctx->flags);
> +		apply_cgroup_root_flags(ctx->flags);
> +	}
>  	return 0;
>  }
>  
>
Kirill Tkhai Aug. 3, 2018, 11:04 a.m. UTC | #9
On 03.08.2018 13:59, Kirill Tkhai wrote:
> On 03.08.2018 13:31, David Howells wrote:
>> The reproducer can be reduced to:
>>
>> 	#define _GNU_SOURCE
>> 	#include <endian.h>
>> 	#include <stdint.h>
>> 	#include <string.h>
>> 	#include <stdio.h>
>> 	#include <sys/syscall.h>
>> 	#include <sys/stat.h>
>> 	#include <sys/mount.h>
>> 	#include <unistd.h>
>> 	#include <fcntl.h>
>>
>> 	const char path[] = "./file0";
>>
>> 	int main()
>> 	{
>> 		mkdir(path, 0);
>> 		mount(path, path, "cgroup2", 0, 0);
>> 		chroot(path);
>> 		umount2(path, 0);
>> 		return 0;
>> 	}
>>
>> and I've found two bugs (see attached patch).  The issue is that
>> do_remount_sb() is called with fc == NULL from umount(), but both
>> cgroup_reconfigure() and do_remount_sb() dereference fc unconditionally.
>>
>> But!  I'm not sure why the reproducer works at all because the umount2() call
>> is *after* the chroot, so should fail on ENOENT before it even gets that far.
>> In fact, umount2() can be called multiple times, apparently successfully, and
>> doesn't actually unmount anything.
> 
> Before I also try to check why it works; just reporting you that the patch
> works the problem in my environment. Thanks, David.

patch *fixes* the problem

> 
>> ---
>> diff --git a/fs/super.c b/fs/super.c
>> index 3fe5d12b7697..321fbc244570 100644
>> --- a/fs/super.c
>> +++ b/fs/super.c
>> @@ -978,7 +978,10 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
>>  	    sb->s_op->remount_fs) {
>>  		if (sb->s_op->reconfigure) {
>>  			retval = sb->s_op->reconfigure(sb, fc);
>> -			sb_flags = fc->sb_flags;
>> +			if (fc)
>> +				sb_flags = fc->sb_flags;
>> +			else
>> +				sb_flags = sb->s_flags;
>>  			if (retval == 0)
>>  				security_sb_reconfigure(fc);
>>  		} else {
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index f3238f38d152..48275fdce053 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -1796,9 +1796,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
>>  
>>  static int cgroup_reconfigure(struct kernfs_root *kf_root, struct fs_context *fc)
>>  {
>> -	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
>> +	if (fc) {
>> +		struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
>>  
>> -	apply_cgroup_root_flags(ctx->flags);
>> +		apply_cgroup_root_flags(ctx->flags);
>> +	}
>>  	return 0;
>>  }
>>  
>>
David Howells Aug. 3, 2018, 11:18 a.m. UTC | #10
David Howells <dhowells@redhat.com> wrote:

> But!  I'm not sure why the reproducer works at all because the umount2() call
> is *after* the chroot, so should fail on ENOENT before it even gets that
> far.

No, it shouldn't.  It did chroot() not chdir().

> In fact, umount2() can be called multiple times, apparently successfully, and
> doesn't actually unmount anything.

Okay, because it chroot'd into the directory.  Should it return EBUSY though?

David
David Howells Aug. 3, 2018, noon UTC | #11
Kirill Tkhai <ktkhai@virtuozzo.com> wrote:

> > Before I also try to check why it works; just reporting you that the patch
> > works the problem in my environment. Thanks, David.
> 
> patch *fixes* the problem

Thanks.  I've folded the patch in.

David
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ea0a46166e8e..0d980e801b8a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -455,6 +455,9 @@  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 					  : SHRINK_BATCH;
 	long scanned = 0, next_deferred;
 
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
 	freeable = shrinker->count_objects(shrinker, shrinkctl);
 	if (freeable == 0 || freeable == SHRINK_EMPTY)
 		return freeable;
@@ -680,9 +683,6 @@  static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 			.memcg = memcg,
 		};
 
-		if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
-			sc.nid = 0;
-
 		ret = do_shrink_slab(&sc, shrinker, priority);
 		if (ret == SHRINK_EMPTY)
 			ret = 0;