diff mbox series

[v5,3/4] mm/page_owner: Print memcg information

Message ID 20220208000532.1054311-4-longman@redhat.com (mailing list archive)
State New
Headers show
Series mm/page_owner: Extend page_owner to show memcg information | expand

Commit Message

Waiman Long Feb. 8, 2022, 12:05 a.m. UTC
It was found that a number of dying memcgs were not freed because
they were pinned by some charged pages that were present. Even "echo 1 >
/proc/sys/vm/drop_caches" wasn't able to free those pages. These dying
but not freed memcgs tend to increase in number over time with the side
effect that percpu memory consumption as shown in /proc/meminfo also
increases over time.

In order to find out more information about those pages that pin
dying memcgs, the page_owner feature is extended to print memory
cgroup information especially whether the cgroup is dying or not.
RCU read lock is taken when memcg is being accessed to make sure
that it won't be freed.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
---
 mm/page_owner.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

Comments

Michal Hocko Feb. 8, 2022, 12:13 p.m. UTC | #1
On Mon 07-02-22 19:05:31, Waiman Long wrote:
> It was found that a number of dying memcgs were not freed because
> they were pinned by some charged pages that were present. Even "echo 1 >
> /proc/sys/vm/drop_caches" wasn't able to free those pages. These dying
> but not freed memcgs tend to increase in number over time with the side
> effect that percpu memory consumption as shown in /proc/meminfo also
> increases over time.

I still believe that this is very suboptimal way to debug offline memcgs
but memcg information can be useful in other contexts and it doesn't
cost us anything except for an additional output so I am fine with this.
 
> In order to find out more information about those pages that pin
> dying memcgs, the page_owner feature is extended to print memory
> cgroup information especially whether the cgroup is dying or not.
> RCU read lock is taken when memcg is being accessed to make sure
> that it won't be freed.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>

With few comments/questions below.

> ---
>  mm/page_owner.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 28dac73e0542..d4c311455753 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -10,6 +10,7 @@
>  #include <linux/migrate.h>
>  #include <linux/stackdepot.h>
>  #include <linux/seq_file.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched/clock.h>
>  
>  #include "internal.h"
> @@ -325,6 +326,47 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>  	seq_putc(m, '\n');
>  }
>  
> +/*
> + * Looking for memcg information and print it out
> + */

I am not sure this is particularly useful comment.

> +static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
> +					 struct page *page)
> +{
> +#ifdef CONFIG_MEMCG
> +	unsigned long memcg_data;
> +	struct mem_cgroup *memcg;
> +	bool dying;
> +
> +	rcu_read_lock();
> +	memcg_data = READ_ONCE(page->memcg_data);
> +	if (!memcg_data)
> +		goto out_unlock;
> +
> +	if (memcg_data & MEMCG_DATA_OBJCGS)
> +		ret += scnprintf(kbuf + ret, count - ret,
> +				"Slab cache page\n");
> +
> +	memcg = page_memcg_check(page);
> +	if (!memcg)
> +		goto out_unlock;
> +
> +	dying = (memcg->css.flags & CSS_DYING);

Is there any specific reason why you haven't used mem_cgroup_online?

> +	ret += scnprintf(kbuf + ret, count - ret,
> +			"Charged %sto %smemcg ",
> +			PageMemcgKmem(page) ? "(via objcg) " : "",
> +			dying ? "dying " : "");
> +
> +	/* Write cgroup name directly into kbuf */
> +	cgroup_name(memcg->css.cgroup, kbuf + ret, count - ret);
> +	ret += strlen(kbuf + ret);

cgroup_name should return the length of the path added to the buffer.

> +	ret += scnprintf(kbuf + ret, count - ret, "\n");

I do not see any overflow prevention here. I believe you really need to
check ret >= count after each scnprintf/cgroup_name.
Michal Hocko Feb. 8, 2022, 3:15 p.m. UTC | #2
On Tue 08-02-22 13:13:15, Michal Hocko wrote:
> On Mon 07-02-22 19:05:31, Waiman Long wrote:
[...]
> > +	ret += scnprintf(kbuf + ret, count - ret, "\n");
> 
> I do not see any overflow prevention here. I believe you really need to
> check ret >= count after each scnprintf/cgroup_name.

OK, I have only now realized that scnprintf would return size - 1 on
overflow so explicitly checking for the overlow shouldn't be reallu
necessary.
Waiman Long Feb. 8, 2022, 6:40 p.m. UTC | #3
On 2/8/22 07:13, Michal Hocko wrote:
> On Mon 07-02-22 19:05:31, Waiman Long wrote:
>> It was found that a number of dying memcgs were not freed because
>> they were pinned by some charged pages that were present. Even "echo 1 >
>> /proc/sys/vm/drop_caches" wasn't able to free those pages. These dying
>> but not freed memcgs tend to increase in number over time with the side
>> effect that percpu memory consumption as shown in /proc/meminfo also
>> increases over time.
> I still believe that this is very suboptimal way to debug offline memcgs
> but memcg information can be useful in other contexts and it doesn't
> cost us anything except for an additional output so I am fine with this.
I am planning to have a follow-up patch to add a new debugfs file for 
just printing page information associated with dying memcgs only. It 
will be based on the existing page_owner code, though. So I need to get 
this patch in first.
>   
>> In order to find out more information about those pages that pin
>> dying memcgs, the page_owner feature is extended to print memory
>> cgroup information especially whether the cgroup is dying or not.
>> RCU read lock is taken when memcg is being accessed to make sure
>> that it won't be freed.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Acked-by: David Rientjes <rientjes@google.com>
>> Acked-by: Roman Gushchin <guro@fb.com>
>> Acked-by: Mike Rapoport <rppt@linux.ibm.com>
> With few comments/questions below.
>
>> ---
>>   mm/page_owner.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 44 insertions(+)
>>
>> diff --git a/mm/page_owner.c b/mm/page_owner.c
>> index 28dac73e0542..d4c311455753 100644
>> --- a/mm/page_owner.c
>> +++ b/mm/page_owner.c
>> @@ -10,6 +10,7 @@
>>   #include <linux/migrate.h>
>>   #include <linux/stackdepot.h>
>>   #include <linux/seq_file.h>
>> +#include <linux/memcontrol.h>
>>   #include <linux/sched/clock.h>
>>   
>>   #include "internal.h"
>> @@ -325,6 +326,47 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>>   	seq_putc(m, '\n');
>>   }
>>   
>> +/*
>> + * Looking for memcg information and print it out
>> + */
> I am not sure this is particularly useful comment.
Right, I can remove that.
>
>> +static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
>> +					 struct page *page)
>> +{
>> +#ifdef CONFIG_MEMCG
>> +	unsigned long memcg_data;
>> +	struct mem_cgroup *memcg;
>> +	bool dying;
>> +
>> +	rcu_read_lock();
>> +	memcg_data = READ_ONCE(page->memcg_data);
>> +	if (!memcg_data)
>> +		goto out_unlock;
>> +
>> +	if (memcg_data & MEMCG_DATA_OBJCGS)
>> +		ret += scnprintf(kbuf + ret, count - ret,
>> +				"Slab cache page\n");
>> +
>> +	memcg = page_memcg_check(page);
>> +	if (!memcg)
>> +		goto out_unlock;
>> +
>> +	dying = (memcg->css.flags & CSS_DYING);
> Is there any specific reason why you haven't used mem_cgroup_online?
Not really. However, I think checking for CSS_DYING makes more sense now 
that I using the term "dying".
>
>> +	ret += scnprintf(kbuf + ret, count - ret,
>> +			"Charged %sto %smemcg ",
>> +			PageMemcgKmem(page) ? "(via objcg) " : "",
>> +			dying ? "dying " : "");
>> +
>> +	/* Write cgroup name directly into kbuf */
>> +	cgroup_name(memcg->css.cgroup, kbuf + ret, count - ret);
>> +	ret += strlen(kbuf + ret);
> cgroup_name should return the length of the path added to the buffer.
I realized that after I sent out the patch. I will remove te redundant 
strlen() in a future update.
>
>> +	ret += scnprintf(kbuf + ret, count - ret, "\n");
> I do not see any overflow prevention here. I believe you really need to
> check ret >= count after each scnprintf/cgroup_name.

As you have realized, the beauty of using scnprintf() is to not needing 
an overflow check after each invocation.

Cheers,
Longman
Michal Hocko Feb. 8, 2022, 7:11 p.m. UTC | #4
On Tue 08-02-22 13:40:57, Waiman Long wrote:
> On 2/8/22 07:13, Michal Hocko wrote:
> > On Mon 07-02-22 19:05:31, Waiman Long wrote:
> > > It was found that a number of dying memcgs were not freed because
> > > they were pinned by some charged pages that were present. Even "echo 1 >
> > > /proc/sys/vm/drop_caches" wasn't able to free those pages. These dying
> > > but not freed memcgs tend to increase in number over time with the side
> > > effect that percpu memory consumption as shown in /proc/meminfo also
> > > increases over time.
> > I still believe that this is very suboptimal way to debug offline memcgs
> > but memcg information can be useful in other contexts and it doesn't
> > cost us anything except for an additional output so I am fine with this.
>
> I am planning to have a follow-up patch to add a new debugfs file for just
> printing page information associated with dying memcgs only. It will be
> based on the existing page_owner code, though. So I need to get this patch
> in first.

Sure. I would give a shot the drgn approach as this can be much more
versatile without any additional kernel code.

[...]

> > > +	dying = (memcg->css.flags & CSS_DYING);
> > Is there any specific reason why you haven't used mem_cgroup_online?
> Not really. However, I think checking for CSS_DYING makes more sense now
> that I using the term "dying".

I do not really care much but I though CSS_DYING is a cgroup internal
thing. We have a highlevel API so I thought it would be used
preferably.
diff mbox series

Patch

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 28dac73e0542..d4c311455753 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -10,6 +10,7 @@ 
 #include <linux/migrate.h>
 #include <linux/stackdepot.h>
 #include <linux/seq_file.h>
+#include <linux/memcontrol.h>
 #include <linux/sched/clock.h>
 
 #include "internal.h"
@@ -325,6 +326,47 @@  void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 	seq_putc(m, '\n');
 }
 
+/*
+ * Looking for memcg information and print it out
+ */
+static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
+					 struct page *page)
+{
+#ifdef CONFIG_MEMCG
+	unsigned long memcg_data;
+	struct mem_cgroup *memcg;
+	bool dying;
+
+	rcu_read_lock();
+	memcg_data = READ_ONCE(page->memcg_data);
+	if (!memcg_data)
+		goto out_unlock;
+
+	if (memcg_data & MEMCG_DATA_OBJCGS)
+		ret += scnprintf(kbuf + ret, count - ret,
+				"Slab cache page\n");
+
+	memcg = page_memcg_check(page);
+	if (!memcg)
+		goto out_unlock;
+
+	dying = (memcg->css.flags & CSS_DYING);
+	ret += scnprintf(kbuf + ret, count - ret,
+			"Charged %sto %smemcg ",
+			PageMemcgKmem(page) ? "(via objcg) " : "",
+			dying ? "dying " : "");
+
+	/* Write cgroup name directly into kbuf */
+	cgroup_name(memcg->css.cgroup, kbuf + ret, count - ret);
+	ret += strlen(kbuf + ret);
+	ret += scnprintf(kbuf + ret, count - ret, "\n");
+out_unlock:
+	rcu_read_unlock();
+#endif /* CONFIG_MEMCG */
+
+	return ret;
+}
+
 static ssize_t
 print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 		struct page *page, struct page_owner *page_owner,
@@ -365,6 +407,8 @@  print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 			migrate_reason_names[page_owner->last_migrate_reason]);
 	}
 
+	ret = print_page_owner_memcg(kbuf, count, ret, page);
+
 	ret += snprintf(kbuf + ret, count - ret, "\n");
 	if (ret >= count)
 		goto err;