[2/7] mm: shrinker: Add a .to_text() method for shrinkers

Message ID	20231122232515.177833-3-kent.overstreet@linux.dev (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kent Overstreet <kent.overstreet@linux.dev> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Kent Overstreet <kent.overstreet@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, Qi Zheng <zhengqi.arch@bytedance.com>, Roman Gushchin <roman.gushchin@linux.dev> Subject: [PATCH 2/7] mm: shrinker: Add a .to_text() method for shrinkers Date: Wed, 22 Nov 2023 18:25:07 -0500 Message-ID: <20231122232515.177833-3-kent.overstreet@linux.dev> In-Reply-To: <20231122232515.177833-1-kent.overstreet@linux.dev> References: <20231122232515.177833-1-kent.overstreet@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	shrinker debugging improvements \| expand [0/7] shrinker debugging improvements [1/7] seq_buf: seq_buf_human_readable_u64() [2/7] mm: shrinker: Add a .to_text() method for shrinkers [3/7] mm: shrinker: Add new stats for .to_text() [4/7] mm: Centralize & improve oom reporting in show_mem.c [5/7] mm: shrinker: Add shrinker_to_text() to debugfs interface [6/7] bcachefs: shrinker.to_text() methods [7/7] bcachefs: add counters for failed shrinker reclaim

Kent Overstreet Nov. 22, 2023, 11:25 p.m. UTC

This adds a new callback method to shrinkers which they can use to
describe anything relevant to memory reclaim about their internal state,
for example object dirtyness.

This patch also adds shrinkers_to_text(), which reports on the top 10
shrinkers - by object count - in sorted order, to be used in OOM
reporting.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: linux-mm@kvack.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/shrinker.h |  6 +++-
 mm/shrinker.c            | 73 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 77 insertions(+), 2 deletions(-)

Kent Overstreet Nov. 23, 2023, 9:24 p.m. UTC | #1

On Thu, Nov 23, 2023 at 11:32:59AM +0800, Qi Zheng wrote:
> > +	void (*to_text)(struct seq_buf *, struct shrinker *);
> 
> The "to_text" looks a little strange, how about naming it
> "stat_objects"?

The convention I've been using heavily in bcachefs is
typename_to_text(), or type.to_text(), for debug reports. The
consistency is nice.

> 
> >   	long batch;	/* reclaim batch size, 0 = default */
> >   	int seeks;	/* seeks to recreate an obj */
> > @@ -110,7 +114,6 @@ struct shrinker {
> >   #endif
> >   #ifdef CONFIG_SHRINKER_DEBUG
> >   	int debugfs_id;
> > -	const char *name;
> 
> The name will only be allocated memory when the CONFIG_SHRINKER_DEBUG is
> enabled, otherwise its value is NULL. So you can't move it out and use
> it while CONFIG_SHRINKER_DEBUG is disabled.

Good catch

> > +void shrinkers_to_text(struct seq_buf *out)
> > +{
> > +	struct shrinker *shrinker;
> > +	struct shrinker_by_mem {
> > +		struct shrinker	*shrinker;
> > +		unsigned long	mem;
> 
> The "mem" also looks a little strange, how about naming it
> "freeable"?

The type is just used in this one function, but sure.

> 
> > +	} shrinkers_by_mem[10];
> > +	int i, nr = 0;
> > +
> > +	if (!mutex_trylock(&shrinker_mutex)) {
> > +		seq_buf_puts(out, "(couldn't take shrinker lock)");
> > +		return;
> > +	}
> 
> We now have lockless method (RCU + refcount) to iterate shrinker_list,
> please refer to shrink_slab().

Will do!

Qi Zheng Nov. 24, 2023, 3:08 a.m. UTC | #2

Hi Kent,

On 2023/11/24 05:24, Kent Overstreet wrote:
> On Thu, Nov 23, 2023 at 11:32:59AM +0800, Qi Zheng wrote:
>>> +	void (*to_text)(struct seq_buf *, struct shrinker *);
>>
>> The "to_text" looks a little strange, how about naming it
>> "stat_objects"?
> 
> The convention I've been using heavily in bcachefs is
> typename_to_text(), or type.to_text(), for debug reports. The

OK.

> consistency is nice.

However, this is inconsistent with the name style of other
shrinker callbacks. Please use the "objects" suffix. As for
bcachefs's own callback function, you can use typename_to_text()
to ensure consistency.

Thanks,
Qi

> 
>>
>>>    	long batch;	/* reclaim batch size, 0 = default */
>>>    	int seeks;	/* seeks to recreate an obj */
>>> @@ -110,7 +114,6 @@ struct shrinker {
>>>    #endif
>>>    #ifdef CONFIG_SHRINKER_DEBUG
>>>    	int debugfs_id;
>>> -	const char *name;
>>
>> The name will only be allocated memory when the CONFIG_SHRINKER_DEBUG is
>> enabled, otherwise its value is NULL. So you can't move it out and use
>> it while CONFIG_SHRINKER_DEBUG is disabled.
> 
> Good catch
> 
>>> +void shrinkers_to_text(struct seq_buf *out)
>>> +{
>>> +	struct shrinker *shrinker;
>>> +	struct shrinker_by_mem {
>>> +		struct shrinker	*shrinker;
>>> +		unsigned long	mem;
>>
>> The "mem" also looks a little strange, how about naming it
>> "freeable"?
> 
> The type is just used in this one function, but sure.
> 
>>
>>> +	} shrinkers_by_mem[10];
>>> +	int i, nr = 0;
>>> +
>>> +	if (!mutex_trylock(&shrinker_mutex)) {
>>> +		seq_buf_puts(out, "(couldn't take shrinker lock)");
>>> +		return;
>>> +	}
>>
>> We now have lockless method (RCU + refcount) to iterate shrinker_list,
>> please refer to shrink_slab().
> 
> Will do!

kernel test robot Nov. 24, 2023, 11:46 a.m. UTC | #3

Hi Kent,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.7-rc2 next-20231124]
[cannot apply to vbabka-slab/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/seq_buf-seq_buf_human_readable_u64/20231123-111937
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231122232515.177833-3-kent.overstreet%40linux.dev
patch subject: [PATCH 2/7] mm: shrinker: Add a .to_text() method for shrinkers
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20231124/202311241457.d18KlTLu-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231124/202311241457.d18KlTLu-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311241457.d18KlTLu-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/shrinker.c:832: warning: Function parameter or member 'out' not described in 'shrinkers_to_text'


vim +832 mm/shrinker.c

   824	
   825	/**
   826	 * shrinkers_to_text - Report on shrinkers with highest usage
   827	 *
   828	 * This reports on the top 10 shrinkers, by object counts, in sorted order:
   829	 * intended to be used for OOM reporting.
   830	 */
   831	void shrinkers_to_text(struct seq_buf *out)
 > 832	{

Kent Overstreet Nov. 25, 2023, 12:30 a.m. UTC | #4

On Fri, Nov 24, 2023 at 11:08:11AM +0800, Qi Zheng wrote:
> Hi Kent,
> 
> On 2023/11/24 05:24, Kent Overstreet wrote:
> > On Thu, Nov 23, 2023 at 11:32:59AM +0800, Qi Zheng wrote:
> > > > +	void (*to_text)(struct seq_buf *, struct shrinker *);
> > > 
> > > The "to_text" looks a little strange, how about naming it
> > > "stat_objects"?
> > 
> > The convention I've been using heavily in bcachefs is
> > typename_to_text(), or type.to_text(), for debug reports. The
> 
> OK.
> 
> > consistency is nice.
> 
> However, this is inconsistent with the name style of other
> shrinker callbacks. Please use the "objects" suffix. As for
> bcachefs's own callback function, you can use typename_to_text()
> to ensure consistency.

That would be inconsistent with introducing a convention to the wider
kernel.

Muchun Song Nov. 28, 2023, 3:27 a.m. UTC | #5

> On Nov 25, 2023, at 08:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> On Fri, Nov 24, 2023 at 11:08:11AM +0800, Qi Zheng wrote:
>> Hi Kent,
>> 
>> On 2023/11/24 05:24, Kent Overstreet wrote:
>>> On Thu, Nov 23, 2023 at 11:32:59AM +0800, Qi Zheng wrote:
>>>>> + void (*to_text)(struct seq_buf *, struct shrinker *);
>>>> 
>>>> The "to_text" looks a little strange, how about naming it
>>>> "stat_objects"?
>>> 
>>> The convention I've been using heavily in bcachefs is
>>> typename_to_text(), or type.to_text(), for debug reports. The
>> 
>> OK.
>> 
>>> consistency is nice.
>> 
>> However, this is inconsistent with the name style of other
>> shrinker callbacks. Please use the "objects" suffix. As for
>> bcachefs's own callback function, you can use typename_to_text()
>> to ensure consistency.
> 
> That would be inconsistent with introducing a convention to the wider
> kernel.
> 

I don not think .to_text is a good name. I really do not know what it means
when I first look at this name. I knew you want to report the objects of
shrinks, so why not use .report_objects or stat_objects proposed by Qi.
Although .to_text is only used by bcachefs now, shrinker is a general module
which is not only serving the bcachefs itself. I think it should be better
to use a more straightforward name.

Muchun,
Thanks.

Kent Overstreet Nov. 28, 2023, 3:53 a.m. UTC | #6

On Tue, Nov 28, 2023 at 11:27:11AM +0800, Muchun Song wrote:
> 
> 
> > On Nov 25, 2023, at 08:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > 
> > On Fri, Nov 24, 2023 at 11:08:11AM +0800, Qi Zheng wrote:
> >> Hi Kent,
> >> 
> >> On 2023/11/24 05:24, Kent Overstreet wrote:
> >>> On Thu, Nov 23, 2023 at 11:32:59AM +0800, Qi Zheng wrote:
> >>>>> + void (*to_text)(struct seq_buf *, struct shrinker *);
> >>>> 
> >>>> The "to_text" looks a little strange, how about naming it
> >>>> "stat_objects"?
> >>> 
> >>> The convention I've been using heavily in bcachefs is
> >>> typename_to_text(), or type.to_text(), for debug reports. The
> >> 
> >> OK.
> >> 
> >>> consistency is nice.
> >> 
> >> However, this is inconsistent with the name style of other
> >> shrinker callbacks. Please use the "objects" suffix. As for
> >> bcachefs's own callback function, you can use typename_to_text()
> >> to ensure consistency.
> > 
> > That would be inconsistent with introducing a convention to the wider
> > kernel.
> > 
> 
> I don not think .to_text is a good name. I really do not know what it means
> when I first look at this name. I knew you want to report the objects of
> shrinks, so why not use .report_objects or stat_objects proposed by Qi.
> Although .to_text is only used by bcachefs now, shrinker is a general module
> which is not only serving the bcachefs itself. I think it should be better
> to use a more straightforward name.

No, .report_objects or .stat_objects would be wrong; this isn't
generating a report on the objects owned by the shrinker, it's just a
report on the statistics of the shrinker itself.

That's why the convention is typename_to_text() - generate a text
representation of an object of that type.

Qi Zheng Nov. 28, 2023, 6:23 a.m. UTC | #7

On 2023/11/28 11:53, Kent Overstreet wrote:
> On Tue, Nov 28, 2023 at 11:27:11AM +0800, Muchun Song wrote:
>>
>>
>>> On Nov 25, 2023, at 08:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>>>
>>> On Fri, Nov 24, 2023 at 11:08:11AM +0800, Qi Zheng wrote:
>>>> Hi Kent,
>>>>
>>>> On 2023/11/24 05:24, Kent Overstreet wrote:
>>>>> On Thu, Nov 23, 2023 at 11:32:59AM +0800, Qi Zheng wrote:
>>>>>>> + void (*to_text)(struct seq_buf *, struct shrinker *);
>>>>>>
>>>>>> The "to_text" looks a little strange, how about naming it
>>>>>> "stat_objects"?
>>>>>
>>>>> The convention I've been using heavily in bcachefs is
>>>>> typename_to_text(), or type.to_text(), for debug reports. The
>>>>
>>>> OK.
>>>>
>>>>> consistency is nice.
>>>>
>>>> However, this is inconsistent with the name style of other
>>>> shrinker callbacks. Please use the "objects" suffix. As for
>>>> bcachefs's own callback function, you can use typename_to_text()
>>>> to ensure consistency.
>>>
>>> That would be inconsistent with introducing a convention to the wider
>>> kernel.
>>>
>>
>> I don not think .to_text is a good name. I really do not know what it means
>> when I first look at this name. I knew you want to report the objects of
>> shrinks, so why not use .report_objects or stat_objects proposed by Qi.
>> Although .to_text is only used by bcachefs now, shrinker is a general module
>> which is not only serving the bcachefs itself. I think it should be better
>> to use a more straightforward name.
> 
> No, .report_objects or .stat_objects would be wrong; this isn't
> generating a report on the objects owned by the shrinker, it's just a
> report on the statistics of the shrinker itself.

Now I think adding this method might not be a good idea. If we allow
shrinkers to report thier own private information, OOM logs may become
cluttered. Most people only care about some general information when
troubleshooting OOM problem, but not the private information of a
shrinker.

So I thought maybe we could add some general statistics to the shrinker,
but adding private ".to_text" method is not necessary.

Also +CC Michal, who is reviwing OOM related patches recently.

Thanks,
Qi

> 
> That's why the convention is typename_to_text() - generate a text
> representation of an object of that type.

Michal Hocko Nov. 28, 2023, 10:01 a.m. UTC | #8

On Wed 22-11-23 18:25:07, Kent Overstreet wrote:
[...]
> +void shrinkers_to_text(struct seq_buf *out)
> +{
> +	struct shrinker *shrinker;
> +	struct shrinker_by_mem {
> +		struct shrinker	*shrinker;
> +		unsigned long	mem;
> +	} shrinkers_by_mem[10];
> +	int i, nr = 0;
> +
> +	if (!mutex_trylock(&shrinker_mutex)) {
> +		seq_buf_puts(out, "(couldn't take shrinker lock)");
> +		return;
> +	}
> +
> +	list_for_each_entry(shrinker, &shrinker_list, list) {
> +		struct shrink_control sc = { .gfp_mask = GFP_KERNEL, };

This seems to be global reclaim specific. What about memcg reclaim?

Kent Overstreet Nov. 28, 2023, 5:48 p.m. UTC | #9

On Tue, Nov 28, 2023 at 11:01:16AM +0100, Michal Hocko wrote:
> On Wed 22-11-23 18:25:07, Kent Overstreet wrote:
> [...]
> > +void shrinkers_to_text(struct seq_buf *out)
> > +{
> > +	struct shrinker *shrinker;
> > +	struct shrinker_by_mem {
> > +		struct shrinker	*shrinker;
> > +		unsigned long	mem;
> > +	} shrinkers_by_mem[10];
> > +	int i, nr = 0;
> > +
> > +	if (!mutex_trylock(&shrinker_mutex)) {
> > +		seq_buf_puts(out, "(couldn't take shrinker lock)");
> > +		return;
> > +	}
> > +
> > +	list_for_each_entry(shrinker, &shrinker_list, list) {
> > +		struct shrink_control sc = { .gfp_mask = GFP_KERNEL, };
> 
> This seems to be global reclaim specific. What about memcg reclaim?

I have no fsckin idea how memcg reclaim works - and, for that matter,
the recent lockless shrinking work seems to have neglected to write even
an iterator macro, leaving _that_ a nasty mess so I'm not touching that
either.

Roman Gushchin Nov. 29, 2023, 12:34 a.m. UTC | #10

On Tue, Nov 28, 2023 at 02:23:36PM +0800, Qi Zheng wrote:
> 
> 
> On 2023/11/28 11:53, Kent Overstreet wrote:
> > On Tue, Nov 28, 2023 at 11:27:11AM +0800, Muchun Song wrote:
> > > 
> > > 
> > > > On Nov 25, 2023, at 08:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > 
> > > > On Fri, Nov 24, 2023 at 11:08:11AM +0800, Qi Zheng wrote:
> > > > > Hi Kent,
> > > > > 
> > > > > On 2023/11/24 05:24, Kent Overstreet wrote:
> > > > > > On Thu, Nov 23, 2023 at 11:32:59AM +0800, Qi Zheng wrote:
> > > > > > > > + void (*to_text)(struct seq_buf *, struct shrinker *);
> > > > > > > 
> > > > > > > The "to_text" looks a little strange, how about naming it
> > > > > > > "stat_objects"?
> > > > > > 
> > > > > > The convention I've been using heavily in bcachefs is
> > > > > > typename_to_text(), or type.to_text(), for debug reports. The
> > > > > 
> > > > > OK.
> > > > > 
> > > > > > consistency is nice.
> > > > > 
> > > > > However, this is inconsistent with the name style of other
> > > > > shrinker callbacks. Please use the "objects" suffix. As for
> > > > > bcachefs's own callback function, you can use typename_to_text()
> > > > > to ensure consistency.
> > > > 
> > > > That would be inconsistent with introducing a convention to the wider
> > > > kernel.
> > > > 
> > > 
> > > I don not think .to_text is a good name. I really do not know what it means
> > > when I first look at this name. I knew you want to report the objects of
> > > shrinks, so why not use .report_objects or stat_objects proposed by Qi.
> > > Although .to_text is only used by bcachefs now, shrinker is a general module
> > > which is not only serving the bcachefs itself. I think it should be better
> > > to use a more straightforward name.
> > 
> > No, .report_objects or .stat_objects would be wrong; this isn't
> > generating a report on the objects owned by the shrinker, it's just a
> > report on the statistics of the shrinker itself.
> 
> Now I think adding this method might not be a good idea. If we allow
> shrinkers to report thier own private information, OOM logs may become
> cluttered. Most people only care about some general information when
> troubleshooting OOM problem, but not the private information of a
> shrinker.

I agree with that.

It seems that the feature is mostly useful for kernel developers and it's easily
achievable by attaching a bpf program to the oom handler. If it requires a bit
of work on the bpf side, we can do that instead, but probably not. And this
solution can potentially provide way more information in a more flexible way.

So I'm not convinced it's a good idea to make the generic oom handling code
more complicated and fragile for everybody, as well as making oom reports differ
more between kernel versions and configurations.

Thanks!

Michal Hocko Nov. 29, 2023, 9:14 a.m. UTC | #11

On Tue 28-11-23 16:34:35, Roman Gushchin wrote:
> On Tue, Nov 28, 2023 at 02:23:36PM +0800, Qi Zheng wrote:
[...]
> > Now I think adding this method might not be a good idea. If we allow
> > shrinkers to report thier own private information, OOM logs may become
> > cluttered. Most people only care about some general information when
> > troubleshooting OOM problem, but not the private information of a
> > shrinker.
> 
> I agree with that.
> 
> It seems that the feature is mostly useful for kernel developers and it's easily
> achievable by attaching a bpf program to the oom handler. If it requires a bit
> of work on the bpf side, we can do that instead, but probably not. And this
> solution can potentially provide way more information in a more flexible way.
> 
> So I'm not convinced it's a good idea to make the generic oom handling code
> more complicated and fragile for everybody, as well as making oom reports differ
> more between kernel versions and configurations.

Completely agreed! From my many years of experience of oom reports
analysing from production systems I would conclude the following categories
	- clear runaways (and/or memory leaks)
		- userspace consumers - either shmem or anonymous memory
		  predominantly consumes the memory, swap is either depleted
		  or not configured.
		  OOM report is usually useful to pinpoint those as we
		  have required counters available
		- kernel memory consumers - if we are lucky they are
		  using slab allocator and unreclaimable slab is a huge
		  part of the memory consumption. If this is a page
		  allocator user the oom repport only helps to deduce
		  the fact by looking at how much user + slab + page
		  table etc. form. But identifying the root cause is
		  close to impossible without something like page_owner
		  or a crash dump.
	- misbehaving memory reclaim
		- minority of issues and the oom report is usually
		  insufficient to drill down to the root cause. If the
		  problem is reproducible then collecting vmstat data
		  can give a much better clue.
		- high number of slab reclaimable objects or free swap
		  are good indicators. Shrinkers data could be
		  potentially helpful in the slab case but I really have
		  hard time to remember any such situation.
On non-production systems the situation is quite different. I can see
how it could be very beneficial to add a very specific debugging data
for subsystem/shrinker which is developed and could cause the OOM. For
that purpose the proposed scheme is rather inflexible AFAICS.

Michal Hocko Nov. 29, 2023, 4:02 p.m. UTC | #12

On Tue 28-11-23 12:48:53, Kent Overstreet wrote:
> On Tue, Nov 28, 2023 at 11:01:16AM +0100, Michal Hocko wrote:
> > On Wed 22-11-23 18:25:07, Kent Overstreet wrote:
> > [...]
> > > +void shrinkers_to_text(struct seq_buf *out)
> > > +{
> > > +	struct shrinker *shrinker;
> > > +	struct shrinker_by_mem {
> > > +		struct shrinker	*shrinker;
> > > +		unsigned long	mem;
> > > +	} shrinkers_by_mem[10];
> > > +	int i, nr = 0;
> > > +
> > > +	if (!mutex_trylock(&shrinker_mutex)) {
> > > +		seq_buf_puts(out, "(couldn't take shrinker lock)");
> > > +		return;
> > > +	}
> > > +
> > > +	list_for_each_entry(shrinker, &shrinker_list, list) {
> > > +		struct shrink_control sc = { .gfp_mask = GFP_KERNEL, };
> > 
> > This seems to be global reclaim specific. What about memcg reclaim?
> 
> I have no fsckin idea how memcg reclaim works - and, for that matter,
> the recent lockless shrinking work seems to have neglected to write even
> an iterator macro, leaving _that_ a nasty mess so I'm not touching that
> either.

OK, you could have made it more clearly that all of this is aiming at
the global OOM handling. With an outlook on what should be done if this
was ever required.

Another thing you want to look into is a NUMA constrained OOM (mbind,
cpuset) where this output could be actively misleading.

Kent Overstreet Nov. 29, 2023, 10:36 p.m. UTC | #13

On Wed, Nov 29, 2023 at 05:02:35PM +0100, Michal Hocko wrote:
> On Tue 28-11-23 12:48:53, Kent Overstreet wrote:
> > On Tue, Nov 28, 2023 at 11:01:16AM +0100, Michal Hocko wrote:
> > > On Wed 22-11-23 18:25:07, Kent Overstreet wrote:
> > > [...]
> > > > +void shrinkers_to_text(struct seq_buf *out)
> > > > +{
> > > > +	struct shrinker *shrinker;
> > > > +	struct shrinker_by_mem {
> > > > +		struct shrinker	*shrinker;
> > > > +		unsigned long	mem;
> > > > +	} shrinkers_by_mem[10];
> > > > +	int i, nr = 0;
> > > > +
> > > > +	if (!mutex_trylock(&shrinker_mutex)) {
> > > > +		seq_buf_puts(out, "(couldn't take shrinker lock)");
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	list_for_each_entry(shrinker, &shrinker_list, list) {
> > > > +		struct shrink_control sc = { .gfp_mask = GFP_KERNEL, };
> > > 
> > > This seems to be global reclaim specific. What about memcg reclaim?
> > 
> > I have no fsckin idea how memcg reclaim works - and, for that matter,
> > the recent lockless shrinking work seems to have neglected to write even
> > an iterator macro, leaving _that_ a nasty mess so I'm not touching that
> > either.
> 
> OK, you could have made it more clearly that all of this is aiming at
> the global OOM handling. With an outlook on what should be done if this
> was ever required.
> 
> Another thing you want to look into is a NUMA constrained OOM (mbind,
> cpuset) where this output could be actively misleading.

Yeah.

It's not clear to me how (if at all) we want memcg to be represented in
this output; it's not something us filesystem developers think about a
lot. NUMA definitely should, though.

Kent Overstreet Nov. 29, 2023, 11:11 p.m. UTC | #14

On Wed, Nov 29, 2023 at 10:14:54AM +0100, Michal Hocko wrote:
> On Tue 28-11-23 16:34:35, Roman Gushchin wrote:
> > On Tue, Nov 28, 2023 at 02:23:36PM +0800, Qi Zheng wrote:
> [...]
> > > Now I think adding this method might not be a good idea. If we allow
> > > shrinkers to report thier own private information, OOM logs may become
> > > cluttered. Most people only care about some general information when
> > > troubleshooting OOM problem, but not the private information of a
> > > shrinker.
> > 
> > I agree with that.
> > 
> > It seems that the feature is mostly useful for kernel developers and it's easily
> > achievable by attaching a bpf program to the oom handler. If it requires a bit
> > of work on the bpf side, we can do that instead, but probably not. And this
> > solution can potentially provide way more information in a more flexible way.
> > 
> > So I'm not convinced it's a good idea to make the generic oom handling code
> > more complicated and fragile for everybody, as well as making oom reports differ
> > more between kernel versions and configurations.
> 
> Completely agreed! From my many years of experience of oom reports
> analysing from production systems I would conclude the following categories
> 	- clear runaways (and/or memory leaks)
> 		- userspace consumers - either shmem or anonymous memory
> 		  predominantly consumes the memory, swap is either depleted
> 		  or not configured.
> 		  OOM report is usually useful to pinpoint those as we
> 		  have required counters available
> 		- kernel memory consumers - if we are lucky they are
> 		  using slab allocator and unreclaimable slab is a huge
> 		  part of the memory consumption. If this is a page
> 		  allocator user the oom repport only helps to deduce
> 		  the fact by looking at how much user + slab + page
> 		  table etc. form. But identifying the root cause is
> 		  close to impossible without something like page_owner
> 		  or a crash dump.
> 	- misbehaving memory reclaim
> 		- minority of issues and the oom report is usually
> 		  insufficient to drill down to the root cause. If the
> 		  problem is reproducible then collecting vmstat data
> 		  can give a much better clue.
> 		- high number of slab reclaimable objects or free swap
> 		  are good indicators. Shrinkers data could be
> 		  potentially helpful in the slab case but I really have
> 		  hard time to remember any such situation.
> On non-production systems the situation is quite different. I can see
> how it could be very beneficial to add a very specific debugging data
> for subsystem/shrinker which is developed and could cause the OOM. For
> that purpose the proposed scheme is rather inflexible AFAICS.

Considering that you're an MM guy, and that shrinkers are pretty much
universally used by _filesystem_ people - I'm not sure your experience
is the most relevant here?

The general attitude I've been seeing in this thread has been one of
dismissiveness towards filesystem people. Roman too; back when he was
working on his shrinker debug feature I reached out to him, explained
that I was working on my own, and asked about collaborating - got
crickets in response...

Hmm..

Besides that, I haven't seen anything what-so-ever out of you guys to
make our lives easier, regarding OOM debugging, nor do you guys even
seem interested in the needs and perspectives of the filesytem people.
Roman, your feature didn't help one bit for OOM debuging - didn't even
come with documentation or hints as to what it's for.

BPF? Please.

Anyways.

Regarding log spam: that's something this patchset already starts to
address. I don't think we needed to be dumping every single slab in the
system, for ~2 pages worth of logs; hence this patchset changes that to
just print the top 10.

The same approach is taken with shrinkers: more targeted, less spammy
output.

So now that that concern has been addressed, perhaps some actual meat:

For one, the patchset adds tracking for when a shrinker was last asked
to free something, vs. when it was actually freed. So right there, we
can finally see at a glance when a shrinker has gotten stuck and which
one.

Next up, why has a shrinker gotten stuck?

That's why the .to_text() callback is needed: _shrinkers have internal
state_, and the reasons objects may not be getting freed are specific to
a given shrinker implementation. In bcachefs we added counters for each
individual reason an object may be skipped by the shrinker (io in
flight? that debugged a runaway prefetch issue. too many dirty? that
points to journal reclaim).

I'm working on a .to_text() function for the struct super_block
shrinker, will post that one soon...

Qi Zheng Nov. 30, 2023, 3:09 a.m. UTC | #15

On 2023/11/30 07:11, Kent Overstreet wrote:
> On Wed, Nov 29, 2023 at 10:14:54AM +0100, Michal Hocko wrote:
>> On Tue 28-11-23 16:34:35, Roman Gushchin wrote:
>>> On Tue, Nov 28, 2023 at 02:23:36PM +0800, Qi Zheng wrote:
>> [...]
>>>> Now I think adding this method might not be a good idea. If we allow
>>>> shrinkers to report thier own private information, OOM logs may become
>>>> cluttered. Most people only care about some general information when
>>>> troubleshooting OOM problem, but not the private information of a
>>>> shrinker.
>>>
>>> I agree with that.
>>>
>>> It seems that the feature is mostly useful for kernel developers and it's easily
>>> achievable by attaching a bpf program to the oom handler. If it requires a bit
>>> of work on the bpf side, we can do that instead, but probably not. And this
>>> solution can potentially provide way more information in a more flexible way.
>>>
>>> So I'm not convinced it's a good idea to make the generic oom handling code
>>> more complicated and fragile for everybody, as well as making oom reports differ
>>> more between kernel versions and configurations.
>>
>> Completely agreed! From my many years of experience of oom reports
>> analysing from production systems I would conclude the following categories
>> 	- clear runaways (and/or memory leaks)
>> 		- userspace consumers - either shmem or anonymous memory
>> 		  predominantly consumes the memory, swap is either depleted
>> 		  or not configured.
>> 		  OOM report is usually useful to pinpoint those as we
>> 		  have required counters available
>> 		- kernel memory consumers - if we are lucky they are
>> 		  using slab allocator and unreclaimable slab is a huge
>> 		  part of the memory consumption. If this is a page
>> 		  allocator user the oom repport only helps to deduce
>> 		  the fact by looking at how much user + slab + page
>> 		  table etc. form. But identifying the root cause is
>> 		  close to impossible without something like page_owner
>> 		  or a crash dump.
>> 	- misbehaving memory reclaim
>> 		- minority of issues and the oom report is usually
>> 		  insufficient to drill down to the root cause. If the
>> 		  problem is reproducible then collecting vmstat data
>> 		  can give a much better clue.
>> 		- high number of slab reclaimable objects or free swap
>> 		  are good indicators. Shrinkers data could be
>> 		  potentially helpful in the slab case but I really have
>> 		  hard time to remember any such situation.
>> On non-production systems the situation is quite different. I can see
>> how it could be very beneficial to add a very specific debugging data
>> for subsystem/shrinker which is developed and could cause the OOM. For
>> that purpose the proposed scheme is rather inflexible AFAICS.
> 
> Considering that you're an MM guy, and that shrinkers are pretty much
> universally used by _filesystem_ people - I'm not sure your experience
> is the most relevant here?
> 
> The general attitude I've been seeing in this thread has been one of
> dismissiveness towards filesystem people. Roman too; back when he was

Oh, please don't say that, it seems like you are the only one causing
the fight. We deeply respect the opinions of file system developers, so
I invited Dave to this thread from the beginning. And you didn’t CC
linux-fsdevel@vger.kernel.org yourself.

> working on his shrinker debug feature I reached out to him, explained
> that I was working on my own, and asked about collaborating - got
> crickets in response...
> 
> Hmm..
> 
> Besides that, I haven't seen anything what-so-ever out of you guys to
> make our lives easier, regarding OOM debugging, nor do you guys even
> seem interested in the needs and perspectives of the filesytem people.
> Roman, your feature didn't help one bit for OOM debuging - didn't even
> come with documentation or hints as to what it's for.
> 
> BPF? Please.

(Disclaimer, no intention to start a fight, here are some objective
views.)

Why not? In addition to printk, there are many good debugging tools
worth trying, such as BPF related tools, drgn, etc.

For non-bcachefs developers, who knows what those statistics mean?

You can use BPF or drgn to traverse in advance to get the address of the
bcachefs shrinker structure, and then during OOM, find the bcachefs
private structure through the shrinker->private_data member, and then
dump the bcachefs private data. Is there any problem with this?

Thanks,
Qi

> 
> Anyways.
> 
> Regarding log spam: that's something this patchset already starts to
> address. I don't think we needed to be dumping every single slab in the
> system, for ~2 pages worth of logs; hence this patchset changes that to
> just print the top 10.
> 
> The same approach is taken with shrinkers: more targeted, less spammy
> output.
> 
> So now that that concern has been addressed, perhaps some actual meat:
> 
> For one, the patchset adds tracking for when a shrinker was last asked
> to free something, vs. when it was actually freed. So right there, we
> can finally see at a glance when a shrinker has gotten stuck and which
> one.
> 
> Next up, why has a shrinker gotten stuck?
> 
> That's why the .to_text() callback is needed: _shrinkers have internal
> state_, and the reasons objects may not be getting freed are specific to
> a given shrinker implementation. In bcachefs we added counters for each
> individual reason an object may be skipped by the shrinker (io in
> flight? that debugged a runaway prefetch issue. too many dirty? that
> points to journal reclaim).
> 
> I'm working on a .to_text() function for the struct super_block
> shrinker, will post that one soon...

Kent Overstreet Nov. 30, 2023, 3:21 a.m. UTC | #16

On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote:
> 
> 
> On 2023/11/30 07:11, Kent Overstreet wrote:
> > On Wed, Nov 29, 2023 at 10:14:54AM +0100, Michal Hocko wrote:
> > > On Tue 28-11-23 16:34:35, Roman Gushchin wrote:
> > > > On Tue, Nov 28, 2023 at 02:23:36PM +0800, Qi Zheng wrote:
> > > [...]
> > > > > Now I think adding this method might not be a good idea. If we allow
> > > > > shrinkers to report thier own private information, OOM logs may become
> > > > > cluttered. Most people only care about some general information when
> > > > > troubleshooting OOM problem, but not the private information of a
> > > > > shrinker.
> > > > 
> > > > I agree with that.
> > > > 
> > > > It seems that the feature is mostly useful for kernel developers and it's easily
> > > > achievable by attaching a bpf program to the oom handler. If it requires a bit
> > > > of work on the bpf side, we can do that instead, but probably not. And this
> > > > solution can potentially provide way more information in a more flexible way.
> > > > 
> > > > So I'm not convinced it's a good idea to make the generic oom handling code
> > > > more complicated and fragile for everybody, as well as making oom reports differ
> > > > more between kernel versions and configurations.
> > > 
> > > Completely agreed! From my many years of experience of oom reports
> > > analysing from production systems I would conclude the following categories
> > > 	- clear runaways (and/or memory leaks)
> > > 		- userspace consumers - either shmem or anonymous memory
> > > 		  predominantly consumes the memory, swap is either depleted
> > > 		  or not configured.
> > > 		  OOM report is usually useful to pinpoint those as we
> > > 		  have required counters available
> > > 		- kernel memory consumers - if we are lucky they are
> > > 		  using slab allocator and unreclaimable slab is a huge
> > > 		  part of the memory consumption. If this is a page
> > > 		  allocator user the oom repport only helps to deduce
> > > 		  the fact by looking at how much user + slab + page
> > > 		  table etc. form. But identifying the root cause is
> > > 		  close to impossible without something like page_owner
> > > 		  or a crash dump.
> > > 	- misbehaving memory reclaim
> > > 		- minority of issues and the oom report is usually
> > > 		  insufficient to drill down to the root cause. If the
> > > 		  problem is reproducible then collecting vmstat data
> > > 		  can give a much better clue.
> > > 		- high number of slab reclaimable objects or free swap
> > > 		  are good indicators. Shrinkers data could be
> > > 		  potentially helpful in the slab case but I really have
> > > 		  hard time to remember any such situation.
> > > On non-production systems the situation is quite different. I can see
> > > how it could be very beneficial to add a very specific debugging data
> > > for subsystem/shrinker which is developed and could cause the OOM. For
> > > that purpose the proposed scheme is rather inflexible AFAICS.
> > 
> > Considering that you're an MM guy, and that shrinkers are pretty much
> > universally used by _filesystem_ people - I'm not sure your experience
> > is the most relevant here?
> > 
> > The general attitude I've been seeing in this thread has been one of
> > dismissiveness towards filesystem people. Roman too; back when he was
> 
> Oh, please don't say that, it seems like you are the only one causing
> the fight. We deeply respect the opinions of file system developers, so
> I invited Dave to this thread from the beginning. And you didn’t CC
> linux-fsdevel@vger.kernel.org yourself.
> 
> > working on his shrinker debug feature I reached out to him, explained
> > that I was working on my own, and asked about collaborating - got
> > crickets in response...
> > 
> > Hmm..
> > 
> > Besides that, I haven't seen anything what-so-ever out of you guys to
> > make our lives easier, regarding OOM debugging, nor do you guys even
> > seem interested in the needs and perspectives of the filesytem people.
> > Roman, your feature didn't help one bit for OOM debuging - didn't even
> > come with documentation or hints as to what it's for.
> > 
> > BPF? Please.
> 
> (Disclaimer, no intention to start a fight, here are some objective
> views.)
> 
> Why not? In addition to printk, there are many good debugging tools
> worth trying, such as BPF related tools, drgn, etc.
> 
> For non-bcachefs developers, who knows what those statistics mean?
> 
> You can use BPF or drgn to traverse in advance to get the address of the
> bcachefs shrinker structure, and then during OOM, find the bcachefs
> private structure through the shrinker->private_data member, and then
> dump the bcachefs private data. Is there any problem with this?

No, BPF is not an excuse for improving our OOM/allocation failure
reports. BPF/tracing are secondary tools; whenever we're logging
information about a problem we should strive to log enough information
to debug the issue.

We've got junk in there we don't need: as mentioned before, there's no
need to be dumping information on _every_ slab, we can pick the ones
using the most memory and show those.

Similarly for shrinkers, we're not going to be printing all of them -
the patchset picks the top 10 by objects and prints those. Could
probably be ~4, there's fewer shrinkers than slabs; also if we can get
shrinkers to report on memory owned in bytes, that will help too with
deciding what information is pertinent.

That's not a huge amount of information to be dumping, and to make it
easier to debug something that has historically been a major pain point.

There's a lot more that could be done to make our OOM reports more
readable and useful to non-mm developers. Unfortunately, any time
changing the show_mem report the immediate reaction seems to be "but
that will break my log parsing/change what I'm used to!"...

Qi Zheng Nov. 30, 2023, 3:42 a.m. UTC | #17

On 2023/11/30 11:21, Kent Overstreet wrote:
> On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote:
>>
>>
>> On 2023/11/30 07:11, Kent Overstreet wrote:
>>> On Wed, Nov 29, 2023 at 10:14:54AM +0100, Michal Hocko wrote:
>>>> On Tue 28-11-23 16:34:35, Roman Gushchin wrote:
>>>>> On Tue, Nov 28, 2023 at 02:23:36PM +0800, Qi Zheng wrote:
>>>> [...]
>>>>>> Now I think adding this method might not be a good idea. If we allow
>>>>>> shrinkers to report thier own private information, OOM logs may become
>>>>>> cluttered. Most people only care about some general information when
>>>>>> troubleshooting OOM problem, but not the private information of a
>>>>>> shrinker.
>>>>>
>>>>> I agree with that.
>>>>>
>>>>> It seems that the feature is mostly useful for kernel developers and it's easily
>>>>> achievable by attaching a bpf program to the oom handler. If it requires a bit
>>>>> of work on the bpf side, we can do that instead, but probably not. And this
>>>>> solution can potentially provide way more information in a more flexible way.
>>>>>
>>>>> So I'm not convinced it's a good idea to make the generic oom handling code
>>>>> more complicated and fragile for everybody, as well as making oom reports differ
>>>>> more between kernel versions and configurations.
>>>>
>>>> Completely agreed! From my many years of experience of oom reports
>>>> analysing from production systems I would conclude the following categories
>>>> 	- clear runaways (and/or memory leaks)
>>>> 		- userspace consumers - either shmem or anonymous memory
>>>> 		  predominantly consumes the memory, swap is either depleted
>>>> 		  or not configured.
>>>> 		  OOM report is usually useful to pinpoint those as we
>>>> 		  have required counters available
>>>> 		- kernel memory consumers - if we are lucky they are
>>>> 		  using slab allocator and unreclaimable slab is a huge
>>>> 		  part of the memory consumption. If this is a page
>>>> 		  allocator user the oom repport only helps to deduce
>>>> 		  the fact by looking at how much user + slab + page
>>>> 		  table etc. form. But identifying the root cause is
>>>> 		  close to impossible without something like page_owner
>>>> 		  or a crash dump.
>>>> 	- misbehaving memory reclaim
>>>> 		- minority of issues and the oom report is usually
>>>> 		  insufficient to drill down to the root cause. If the
>>>> 		  problem is reproducible then collecting vmstat data
>>>> 		  can give a much better clue.
>>>> 		- high number of slab reclaimable objects or free swap
>>>> 		  are good indicators. Shrinkers data could be
>>>> 		  potentially helpful in the slab case but I really have
>>>> 		  hard time to remember any such situation.
>>>> On non-production systems the situation is quite different. I can see
>>>> how it could be very beneficial to add a very specific debugging data
>>>> for subsystem/shrinker which is developed and could cause the OOM. For
>>>> that purpose the proposed scheme is rather inflexible AFAICS.
>>>
>>> Considering that you're an MM guy, and that shrinkers are pretty much
>>> universally used by _filesystem_ people - I'm not sure your experience
>>> is the most relevant here?
>>>
>>> The general attitude I've been seeing in this thread has been one of
>>> dismissiveness towards filesystem people. Roman too; back when he was
>>
>> Oh, please don't say that, it seems like you are the only one causing
>> the fight. We deeply respect the opinions of file system developers, so
>> I invited Dave to this thread from the beginning. And you didn’t CC
>> linux-fsdevel@vger.kernel.org yourself.
>>
>>> working on his shrinker debug feature I reached out to him, explained
>>> that I was working on my own, and asked about collaborating - got
>>> crickets in response...
>>>
>>> Hmm..
>>>
>>> Besides that, I haven't seen anything what-so-ever out of you guys to
>>> make our lives easier, regarding OOM debugging, nor do you guys even
>>> seem interested in the needs and perspectives of the filesytem people.
>>> Roman, your feature didn't help one bit for OOM debuging - didn't even
>>> come with documentation or hints as to what it's for.
>>>
>>> BPF? Please.
>>
>> (Disclaimer, no intention to start a fight, here are some objective
>> views.)
>>
>> Why not? In addition to printk, there are many good debugging tools
>> worth trying, such as BPF related tools, drgn, etc.
>>
>> For non-bcachefs developers, who knows what those statistics mean?
>>
>> You can use BPF or drgn to traverse in advance to get the address of the
>> bcachefs shrinker structure, and then during OOM, find the bcachefs
>> private structure through the shrinker->private_data member, and then
>> dump the bcachefs private data. Is there any problem with this?
> 
> No, BPF is not an excuse for improving our OOM/allocation failure
> reports. BPF/tracing are secondary tools; whenever we're logging
> information about a problem we should strive to log enough information
> to debug the issue.
> 
> We've got junk in there we don't need: as mentioned before, there's no
> need to be dumping information on _every_ slab, we can pick the ones
> using the most memory and show those.
> 
> Similarly for shrinkers, we're not going to be printing all of them -
> the patchset picks the top 10 by objects and prints those. Could
> probably be ~4, there's fewer shrinkers than slabs; also if we can get
> shrinkers to report on memory owned in bytes, that will help too with
> deciding what information is pertinent.

I'm not worried about the shrinker's general data. What I'm worried
about is the shrinker's private data. Except for the corresponding
developers, others don't know the meaning of the private statistical
data, and we have no control over the printing quantity and form of
the private data. This may indeed cause OOM log confusion and failure
to automatically parse. For this, any thoughts?

> 
> That's not a huge amount of information to be dumping, and to make it
> easier to debug something that has historically been a major pain point.
> 
> There's a lot more that could be done to make our OOM reports more
> readable and useful to non-mm developers. Unfortunately, any time
> changing the show_mem report the immediate reaction seems to be "but
> that will break my log parsing/change what I'm used to!"...

Kent Overstreet Nov. 30, 2023, 4:14 a.m. UTC | #18

On Thu, Nov 30, 2023 at 11:42:45AM +0800, Qi Zheng wrote:
> > Similarly for shrinkers, we're not going to be printing all of them -
> > the patchset picks the top 10 by objects and prints those. Could
> > probably be ~4, there's fewer shrinkers than slabs; also if we can get
> > shrinkers to report on memory owned in bytes, that will help too with
> > deciding what information is pertinent.
> 
> I'm not worried about the shrinker's general data. What I'm worried
> about is the shrinker's private data. Except for the corresponding
> developers, others don't know the meaning of the private statistical
> data, and we have no control over the printing quantity and form of
> the private data. This may indeed cause OOM log confusion and failure
> to automatically parse. For this, any thoughts?

If a shrinker is responsible for the OOM, then that private state is
exactly what is needed to debug the issue.

I explained this earlier - shrinkers can skip reclaiming objects for a
variety of reasons; in bcachefs's case that could be trylock() failure,
an IO in flight, the node being dirty, and more. Unlock the system inode
shrinker, it's much too expensive to move objects on and off the
shrinker list whenever they're touched. Again, this all comes from real
world experience.

The show_mem report is already full of numbers with zero explanation of
how they're relevant for debugging OOMS; we really need to improve how
that is presented as well.

Michal Hocko Nov. 30, 2023, 8:14 a.m. UTC | #19

On Wed 29-11-23 18:11:47, Kent Overstreet wrote:
> On Wed, Nov 29, 2023 at 10:14:54AM +0100, Michal Hocko wrote:
> > On Tue 28-11-23 16:34:35, Roman Gushchin wrote:
> > > On Tue, Nov 28, 2023 at 02:23:36PM +0800, Qi Zheng wrote:
> > [...]
> > > > Now I think adding this method might not be a good idea. If we allow
> > > > shrinkers to report thier own private information, OOM logs may become
> > > > cluttered. Most people only care about some general information when
> > > > troubleshooting OOM problem, but not the private information of a
> > > > shrinker.
> > > 
> > > I agree with that.
> > > 
> > > It seems that the feature is mostly useful for kernel developers and it's easily
> > > achievable by attaching a bpf program to the oom handler. If it requires a bit
> > > of work on the bpf side, we can do that instead, but probably not. And this
> > > solution can potentially provide way more information in a more flexible way.
> > > 
> > > So I'm not convinced it's a good idea to make the generic oom handling code
> > > more complicated and fragile for everybody, as well as making oom reports differ
> > > more between kernel versions and configurations.
> > 
> > Completely agreed! From my many years of experience of oom reports
> > analysing from production systems I would conclude the following categories
> > 	- clear runaways (and/or memory leaks)
> > 		- userspace consumers - either shmem or anonymous memory
> > 		  predominantly consumes the memory, swap is either depleted
> > 		  or not configured.
> > 		  OOM report is usually useful to pinpoint those as we
> > 		  have required counters available
> > 		- kernel memory consumers - if we are lucky they are
> > 		  using slab allocator and unreclaimable slab is a huge
> > 		  part of the memory consumption. If this is a page
> > 		  allocator user the oom repport only helps to deduce
> > 		  the fact by looking at how much user + slab + page
> > 		  table etc. form. But identifying the root cause is
> > 		  close to impossible without something like page_owner
> > 		  or a crash dump.
> > 	- misbehaving memory reclaim
> > 		- minority of issues and the oom report is usually
> > 		  insufficient to drill down to the root cause. If the
> > 		  problem is reproducible then collecting vmstat data
> > 		  can give a much better clue.
> > 		- high number of slab reclaimable objects or free swap
> > 		  are good indicators. Shrinkers data could be
> > 		  potentially helpful in the slab case but I really have
> > 		  hard time to remember any such situation.
> > On non-production systems the situation is quite different. I can see
> > how it could be very beneficial to add a very specific debugging data
> > for subsystem/shrinker which is developed and could cause the OOM. For
> > that purpose the proposed scheme is rather inflexible AFAICS.
> 
> Considering that you're an MM guy, and that shrinkers are pretty much
> universally used by _filesystem_ people - I'm not sure your experience
> is the most relevant here?

I really do not understand where you have concluded that. In those years
of analysis I was not debugging my _own_ code. I was dealing with
customer reports and I would not really blame them to specifically
trigger any class of OOM reports.

> The general attitude I've been seeing in this thread has been one of
> dismissiveness towards filesystem people. Roman too; back when he was
> working on his shrinker debug feature I reached out to him, explained
> that I was working on my own, and asked about collaborating - got
> crickets in response...

This is completely off and it makes me _really_ think whether
discussions like this on is really worth time. You have been presented
arguments, you seem to be convinced that every disagreement is against
you. Not the first time this is happening. Stop it please!

As a matter of fact, you are proposeing a very specific form of
debugging without showing that this is generally useful thing to do or
even giving us couple of examples where that was useful in a production
environment. This is where you should have started at and then we could
help out to form an acceptable solution. Throwing "this does what we
need, take it or leave" attitude is usualy not the best way to get your
work merged.

> Hmm..
> 
> Besides that, I haven't seen anything what-so-ever out of you guys to
> make our lives easier, regarding OOM debugging, nor do you guys even
> seem interested in the needs and perspectives of the filesytem people.
> Roman, your feature didn't help one bit for OOM debuging - didn't even
> come with documentation or hints as to what it's for.
> 
> BPF? Please.
> 
> Anyways.
> 
> Regarding log spam: that's something this patchset already starts to
> address. I don't think we needed to be dumping every single slab in the
> system, for ~2 pages worth of logs; hence this patchset changes that to
> just print the top 10.

Increasing the threshold for slabs to be printed is something I wouldn't
mind at all.

> The same approach is taken with shrinkers: more targeted, less spammy
> output.
> 
> So now that that concern has been addressed, perhaps some actual meat:
> 
> For one, the patchset adds tracking for when a shrinker was last asked
> to free something, vs. when it was actually freed. So right there, we
> can finally see at a glance when a shrinker has gotten stuck and which
> one.

The primary problem I have with this is how to decide whether to dump
shrinker data and/or which shrinkers to mention. How do you know that it
is the specific shrinker which has contributed to the OOM state?
Printing that data unconditionally will very likely be just additional
balast in most production situations. Sure if you are doing a filesystem
development and you are tuning your specific shrinker then this might be
a really important information to have. But then it is a debugging devel
tool rather than something we want or need to have in a generic oom
report.

All that being said, I am with you on the fact that the oom report in
its current form could see improvements. But please when adding more
information please always focus on general usefulness. We have a very
rich tracing capabilities which could be used for ad-hoc or very
specific purposes as it is much more flexible.

Roman Gushchin Nov. 30, 2023, 7:01 p.m. UTC | #20

On Wed, Nov 29, 2023 at 10:21:49PM -0500, Kent Overstreet wrote:
> On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote:
> > 
> > 
> > On 2023/11/30 07:11, Kent Overstreet wrote:
> > > On Wed, Nov 29, 2023 at 10:14:54AM +0100, Michal Hocko wrote:
> > > > On Tue 28-11-23 16:34:35, Roman Gushchin wrote:
> > > > > On Tue, Nov 28, 2023 at 02:23:36PM +0800, Qi Zheng wrote:
> > > > [...]
> > > > > > Now I think adding this method might not be a good idea. If we allow
> > > > > > shrinkers to report thier own private information, OOM logs may become
> > > > > > cluttered. Most people only care about some general information when
> > > > > > troubleshooting OOM problem, but not the private information of a
> > > > > > shrinker.
> > > > > 
> > > > > I agree with that.
> > > > > 
> > > > > It seems that the feature is mostly useful for kernel developers and it's easily
> > > > > achievable by attaching a bpf program to the oom handler. If it requires a bit
> > > > > of work on the bpf side, we can do that instead, but probably not. And this
> > > > > solution can potentially provide way more information in a more flexible way.
> > > > > 
> > > > > So I'm not convinced it's a good idea to make the generic oom handling code
> > > > > more complicated and fragile for everybody, as well as making oom reports differ
> > > > > more between kernel versions and configurations.
> > > > 
> > > > Completely agreed! From my many years of experience of oom reports
> > > > analysing from production systems I would conclude the following categories
> > > > 	- clear runaways (and/or memory leaks)
> > > > 		- userspace consumers - either shmem or anonymous memory
> > > > 		  predominantly consumes the memory, swap is either depleted
> > > > 		  or not configured.
> > > > 		  OOM report is usually useful to pinpoint those as we
> > > > 		  have required counters available
> > > > 		- kernel memory consumers - if we are lucky they are
> > > > 		  using slab allocator and unreclaimable slab is a huge
> > > > 		  part of the memory consumption. If this is a page
> > > > 		  allocator user the oom repport only helps to deduce
> > > > 		  the fact by looking at how much user + slab + page
> > > > 		  table etc. form. But identifying the root cause is
> > > > 		  close to impossible without something like page_owner
> > > > 		  or a crash dump.
> > > > 	- misbehaving memory reclaim
> > > > 		- minority of issues and the oom report is usually
> > > > 		  insufficient to drill down to the root cause. If the
> > > > 		  problem is reproducible then collecting vmstat data
> > > > 		  can give a much better clue.
> > > > 		- high number of slab reclaimable objects or free swap
> > > > 		  are good indicators. Shrinkers data could be
> > > > 		  potentially helpful in the slab case but I really have
> > > > 		  hard time to remember any such situation.
> > > > On non-production systems the situation is quite different. I can see
> > > > how it could be very beneficial to add a very specific debugging data
> > > > for subsystem/shrinker which is developed and could cause the OOM. For
> > > > that purpose the proposed scheme is rather inflexible AFAICS.
> > > 
> > > Considering that you're an MM guy, and that shrinkers are pretty much
> > > universally used by _filesystem_ people - I'm not sure your experience
> > > is the most relevant here?
> > > 
> > > The general attitude I've been seeing in this thread has been one of
> > > dismissiveness towards filesystem people. Roman too; back when he was
> > 
> > Oh, please don't say that, it seems like you are the only one causing
> > the fight. We deeply respect the opinions of file system developers, so
> > I invited Dave to this thread from the beginning. And you didn't CC
> > linux-fsdevel@vger.kernel.org yourself.
> > 
> > > working on his shrinker debug feature I reached out to him, explained
> > > that I was working on my own, and asked about collaborating - got
> > > crickets in response...
> > > 
> > > Hmm..
> > > 
> > > Besides that, I haven't seen anything what-so-ever out of you guys to
> > > make our lives easier, regarding OOM debugging, nor do you guys even
> > > seem interested in the needs and perspectives of the filesytem people.
> > > Roman, your feature didn't help one bit for OOM debuging - didn't even
> > > come with documentation or hints as to what it's for.
> > > 
> > > BPF? Please.
> > 
> > (Disclaimer, no intention to start a fight, here are some objective
> > views.)
> > 
> > Why not? In addition to printk, there are many good debugging tools
> > worth trying, such as BPF related tools, drgn, etc.
> > 
> > For non-bcachefs developers, who knows what those statistics mean?
> > 
> > You can use BPF or drgn to traverse in advance to get the address of the
> > bcachefs shrinker structure, and then during OOM, find the bcachefs
> > private structure through the shrinker->private_data member, and then
> > dump the bcachefs private data. Is there any problem with this?
> 
> No, BPF is not an excuse for improving our OOM/allocation failure
> reports. BPF/tracing are secondary tools; whenever we're logging
> information about a problem we should strive to log enough information
> to debug the issue.

Ok, a simple question then:
why can't you dump /proc/slabinfo after the OOM?

Unlike anon memory, slab memory (fs caches in particular) should not be heavily
affected by killing some userspace task.

Thanks.

Kent Overstreet Dec. 1, 2023, midnight UTC | #21

On Thu, Nov 30, 2023 at 11:01:23AM -0800, Roman Gushchin wrote:
> Ok, a simple question then:
> why can't you dump /proc/slabinfo after the OOM?
> 
> Unlike anon memory, slab memory (fs caches in particular) should not be heavily
> affected by killing some userspace task.

Well, currently the show_mem report dumps slab info if unreclaimable
slab usage is over some threshold (50%?). So it already does what you
describe - sometimes.

One of the patches in this series trims that down to make it more
useful; reporting on only the top 10 slabs, by mmeory usage, in sorted
order and with human readable units.

Dave Chinner Dec. 1, 2023, 1:18 a.m. UTC | #22

On Thu, Nov 30, 2023 at 11:01:23AM -0800, Roman Gushchin wrote:
> On Wed, Nov 29, 2023 at 10:21:49PM -0500, Kent Overstreet wrote:
> > On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote:
> > > For non-bcachefs developers, who knows what those statistics mean?

For non-mm developers, who knows what those internal mm state
statistics mean?

IOWs, a non-mm developer goes and asks a mm developer to
help them decipher the output to determine what to do next.

So why can't a mm developer go an ask a subsystem developer to tell
them what the shrinker oom-kill output means?

Such a question is a demonstration of an unconscious bias
that prioritises internal mm stuff as far more important than what
anyone else outside core-mm might ever need...

> > > You can use BPF or drgn to traverse in advance to get the address of the
> > > bcachefs shrinker structure, and then during OOM, find the bcachefs
> > > private structure through the shrinker->private_data member, and then
> > > dump the bcachefs private data. Is there any problem with this?
> > 
> > No, BPF is not an excuse for improving our OOM/allocation failure
> > reports. BPF/tracing are secondary tools; whenever we're logging
> > information about a problem we should strive to log enough information
> > to debug the issue.
> 
> Ok, a simple question then:
> why can't you dump /proc/slabinfo after the OOM?

Taken to it's logical conclusion, we arrive at:

	OOM-kill doesn't need to output anything at all except for
	what it killed because we can dump
	/proc/{mem,zone,vmalloc,buddy,slab}info after the OOM....

As it is, even asking such a question shows that you haven't looked
at the OOM kill output for a long time - it already reports the slab
cache usage information for caches that are reclaimable.

That is, if too much accounted slab cache based memory consumption
is detected at OOM-kill, it will calldump_unreclaimable_slab() to
dump all the SLAB_RECLAIM_ACCOUNT caches (i.e. those with shrinkers)
to the console as part of the OOM-kill output.

The problem Kent is trying to address is that this output *isn't
sufficient to debug shrinker based memory reclaim issues*. It hasn't
been for a long time, and so we've all got our own special debug
patches and methods for checking that shrinkers are doing what they
are supposed to. Kent is trying to formalise one of the more useful
general methods for exposing that internal information when OOM
occurs...

Indeed, I can think of several uses for a shrinker->to_text() output
that we simply cannot do right now.

Any shrinker that does garbage collection on something that is not a
pure slab cache (e.g. xfs buffer cache, xfs inode gc subsystem,
graphics memory allocators, binder, etc) has no visibility of the
actuall memory being used by the subsystem in the OOM-kill output.
This information isn't in /proc/slabinfo, it's not accounted by a
SLAB_RECLAIM_ACCOUNT cache, and it's not accounted by anything in
the core mm statistics.

e.g. How does anyone other than a XFS expert know that the 500k of
active xfs_buf handles in the slab cache actually pins 15GB of
cached metadata allocated directly from the page allocator, not just
the 150MB of slab cache the handles take up?

Another example is that an inode can pin lots of heap memory (e.g.
for in-memory extent lists) and that may not be freeable until the
inode is reclaimed. So while the slab cache might not be excesively
large, we might have an a million inodes with a billion cumulative
extents cached in memory and it is the heap memory consumed by the
cached extents that is consuming the 30GB of "missing" kernel memory
that is causing OOM-kills to occur.

How is a user or developer supposed to know when one of these
situations has occurred given the current lack of memory usage
introspection into subsystems?

These are the sorts of situations that shrinker->to_text() would
allow us to enumerate when it is necessary (i.e. at OOM-kill). At
any other time, it just doesn't matter, but when we're at OOM having
a mechanism to report somewhat accurate subsystem memory consumption
would be very useful indeed.

> Unlike anon memory, slab memory (fs caches in particular) should not be heavily
> affected by killing some userspace task.

Whether tasks get killed or not is completely irrelevant. The issue
is that not all memory that is reclaimed by shrinkers is either pure
slab cache memory or directly accounted as reclaimable to the mm
subsystem....

-Dave.

Kent Overstreet Dec. 1, 2023, 1:47 a.m. UTC | #23

On Thu, Nov 30, 2023 at 09:14:35AM +0100, Michal Hocko wrote:
> On Wed 29-11-23 18:11:47, Kent Overstreet wrote:
> > Considering that you're an MM guy, and that shrinkers are pretty much
> > universally used by _filesystem_ people - I'm not sure your experience
> > is the most relevant here?
> 
> I really do not understand where you have concluded that. In those years
> of analysis I was not debugging my _own_ code. I was dealing with
> customer reports and I would not really blame them to specifically
> trigger any class of OOM reports.

I've also spent a considerable amount of time debugging OOM issues, and
a lot of that took a lot longer than it should of due to insufficient
visibility in what the system was doing.

I'm talking about things like tuning journal reclaim/writeback behaviour
(this is a tricky one! shrinkers can't shrink if all items are dirty,
but random update workloads really suffer if we're biasing too much in
favour of memory reclaim, i.e. limiting dirty ratio too much), or
debugging tests in fstests that really like to exhaust memory on just
the inode cache.

If you can take the time to understand what other people are trying to
do and share your own perspective on what you find useful -  instead of
just saying "I've spent a lot of time on OOM reports and I haven't need
any of this/this is just for debugging" - we'll be able to have a much
more productive discussion.

Regarding another point you guys have been making - that this is "just
for developers debugging their own code" - that's a terribly dismissive
attitude to take as well.

Debugging doesn't stop when we're done testing the code on our local
machine and push it out to be merged; we're constantly debugging our
own code as it is running in the wild based on sparse bug reports with
at most a dmesg log. That dmesg log needs to, whenever possible, have
all the information we need to debug the issue.

In bcachefs, I have made this principle a _high_ priority; when I have a
bug in front of me, if there's visibility improvements that would make
the issue easier to debug I prioritize that _first_, and then fix the
actual bug. That's been one of the guiding principles that have enabled
me to work efficiently.

Code should tell you _what_ went wrong when something goes wrong,
whenever possible. Not just for ourselves, the individual developer, it
makes our code more maintainable by the people tha come after us.

> > For one, the patchset adds tracking for when a shrinker was last asked
> > to free something, vs. when it was actually freed. So right there, we
> > can finally see at a glance when a shrinker has gotten stuck and which
> > one.
> 
> The primary problem I have with this is how to decide whether to dump
> shrinker data and/or which shrinkers to mention. How do you know that it
> is the specific shrinker which has contributed to the OOM state?
> Printing that data unconditionally will very likely be just additional
> balast in most production situations. Sure if you are doing a filesystem
> development and you are tuning your specific shrinker then this might be
> a really important information to have. But then it is a debugging devel
> tool rather than something we want or need to have in a generic oom
> report.

Like I've mentioned before, this patchset only reports on the top 10
shrinkers, by number of objects. If we can plumb through reporting on
memory usage in _bytes_, that would help even more with deciding what to
report on.

> All that being said, I am with you on the fact that the oom report in
> its current form could see improvements.

I'm glad we're finally in agreement on something!

If you want to share your own ideas on what could be improved and what
you find useful, maybe we could find some more common ground.

Michal Hocko Dec. 1, 2023, 10:04 a.m. UTC | #24

On Thu 30-11-23 20:47:45, Kent Overstreet wrote:
> On Thu, Nov 30, 2023 at 09:14:35AM +0100, Michal Hocko wrote:
[...]
> > All that being said, I am with you on the fact that the oom report in
> > its current form could see improvements.
> 
> I'm glad we're finally in agreement on something!
> 
> If you want to share your own ideas on what could be improved and what
> you find useful, maybe we could find some more common ground.

One thing that I would consider an improvement is to have a way to
subscribe drivers with excessive memory consumption or those which are
struggling to dump their state.

Maybe your proposal can be extended that way but the crucial point is to
not dump all sorts of random shrinkers' state and end up with unwieldy
reports.  If, on the other hand, any particular shrinker struggles to
reclaim memory and it is sitting on a lot of memory it could be able to
flag itself to be involved in the dump.

Roman Gushchin Dec. 1, 2023, 8:01 p.m. UTC | #25

On Fri, Dec 01, 2023 at 12:18:44PM +1100, Dave Chinner wrote:
> On Thu, Nov 30, 2023 at 11:01:23AM -0800, Roman Gushchin wrote:
> > On Wed, Nov 29, 2023 at 10:21:49PM -0500, Kent Overstreet wrote:
> > > On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote:
> > > > For non-bcachefs developers, who knows what those statistics mean?
> 
> > Ok, a simple question then:
> > why can't you dump /proc/slabinfo after the OOM?
> 
> Taken to it's logical conclusion, we arrive at:
> 
> 	OOM-kill doesn't need to output anything at all except for
> 	what it killed because we can dump
> 	/proc/{mem,zone,vmalloc,buddy,slab}info after the OOM....
> 
> As it is, even asking such a question shows that you haven't looked
> at the OOM kill output for a long time - it already reports the slab
> cache usage information for caches that are reclaimable.
> 
> That is, if too much accounted slab cache based memory consumption
> is detected at OOM-kill, it will calldump_unreclaimable_slab() to
> dump all the SLAB_RECLAIM_ACCOUNT caches (i.e. those with shrinkers)
> to the console as part of the OOM-kill output.

You are right, I missed that, partially because most of OOM's I had to deal
with recently were memcg OOM's.

This changes my perspective at Kent's patches, if we dump this information
already, it might be not a bad idea to do it nicer. So I take my words back
here.

> 
> The problem Kent is trying to address is that this output *isn't
> sufficient to debug shrinker based memory reclaim issues*. It hasn't
> been for a long time, and so we've all got our own special debug
> patches and methods for checking that shrinkers are doing what they
> are supposed to. Kent is trying to formalise one of the more useful
> general methods for exposing that internal information when OOM
> occurs...
> 
> Indeed, I can think of several uses for a shrinker->to_text() output
> that we simply cannot do right now.
> 
> Any shrinker that does garbage collection on something that is not a
> pure slab cache (e.g. xfs buffer cache, xfs inode gc subsystem,
> graphics memory allocators, binder, etc) has no visibility of the
> actuall memory being used by the subsystem in the OOM-kill output.
> This information isn't in /proc/slabinfo, it's not accounted by a
> SLAB_RECLAIM_ACCOUNT cache, and it's not accounted by anything in
> the core mm statistics.
> 
> e.g. How does anyone other than a XFS expert know that the 500k of
> active xfs_buf handles in the slab cache actually pins 15GB of
> cached metadata allocated directly from the page allocator, not just
> the 150MB of slab cache the handles take up?
> 
> Another example is that an inode can pin lots of heap memory (e.g.
> for in-memory extent lists) and that may not be freeable until the
> inode is reclaimed. So while the slab cache might not be excesively
> large, we might have an a million inodes with a billion cumulative
> extents cached in memory and it is the heap memory consumed by the
> cached extents that is consuming the 30GB of "missing" kernel memory
> that is causing OOM-kills to occur.
> 
> How is a user or developer supposed to know when one of these
> situations has occurred given the current lack of memory usage
> introspection into subsystems?

What would be the proper solution to this problem from your point of view?
What functionality/API mm can provide to make the life of fs developers
better here?

> 
> These are the sorts of situations that shrinker->to_text() would
> allow us to enumerate when it is necessary (i.e. at OOM-kill). At
> any other time, it just doesn't matter, but when we're at OOM having
> a mechanism to report somewhat accurate subsystem memory consumption
> would be very useful indeed.
> 
> > Unlike anon memory, slab memory (fs caches in particular) should not be heavily
> > affected by killing some userspace task.
> 
> Whether tasks get killed or not is completely irrelevant. The issue
> is that not all memory that is reclaimed by shrinkers is either pure
> slab cache memory or directly accounted as reclaimable to the mm
> subsystem....

My problem with the current OOM reporting infrastructure (and it's a bit an
offtopic here) - it's good for manually looking into these reports, but not
particularly great for automatic collection and analysis at scale.
So this is where I was coming from.

Thanks!

Kent Overstreet Dec. 1, 2023, 9:25 p.m. UTC | #26

On Fri, Dec 01, 2023 at 11:04:23AM +0100, Michal Hocko wrote:
> On Thu 30-11-23 20:47:45, Kent Overstreet wrote:
> > On Thu, Nov 30, 2023 at 09:14:35AM +0100, Michal Hocko wrote:
> [...]
> > > All that being said, I am with you on the fact that the oom report in
> > > its current form could see improvements.
> > 
> > I'm glad we're finally in agreement on something!
> > 
> > If you want to share your own ideas on what could be improved and what
> > you find useful, maybe we could find some more common ground.
> 
> One thing that I would consider an improvement is to have a way to
> subscribe drivers with excessive memory consumption or those which are
> struggling to dump their state.

Remember the memory allocation profiling patchset? The one where you
kept complaining about "maintenancy overhead"?

We can plug that into the show_mem report too, and list the top 10
allocations by file and line number.

> Maybe your proposal can be extended that way but the crucial point is to
> not dump all sorts of random shrinkers' state and end up with unwieldy
> reports.  If, on the other hand, any particular shrinker struggles to
> reclaim memory and it is sitting on a lot of memory it could be able to
> flag itself to be involved in the dump.

Great, since as was mentioned in the original commit message it's not
"all sorts of random shrinkers", but top 10 by objects reported, what
I've got here should make you happy.

Kent Overstreet Dec. 1, 2023, 9:51 p.m. UTC | #27

On Fri, Dec 01, 2023 at 12:01:33PM -0800, Roman Gushchin wrote:
> My problem with the current OOM reporting infrastructure (and it's a bit an
> offtopic here) - it's good for manually looking into these reports, but not
> particularly great for automatic collection and analysis at scale.
> So this is where I was coming from.

Well, toml isn't the worst thing ever, as far as
machine-and-human-readable-formats go.

For those unfamiliar it's related to json: it's less general than json,
but in exchange it's much easier on the eyes.

Michal Hocko Dec. 4, 2023, 10:33 a.m. UTC | #28

On Fri 01-12-23 16:25:06, Kent Overstreet wrote:
> On Fri, Dec 01, 2023 at 11:04:23AM +0100, Michal Hocko wrote:
> > On Thu 30-11-23 20:47:45, Kent Overstreet wrote:
> > > On Thu, Nov 30, 2023 at 09:14:35AM +0100, Michal Hocko wrote:
> > [...]
> > > > All that being said, I am with you on the fact that the oom report in
> > > > its current form could see improvements.
> > > 
> > > I'm glad we're finally in agreement on something!
> > > 
> > > If you want to share your own ideas on what could be improved and what
> > > you find useful, maybe we could find some more common ground.
> > 
> > One thing that I would consider an improvement is to have a way to
> > subscribe drivers with excessive memory consumption or those which are
> > struggling to dump their state.
> 
> Remember the memory allocation profiling patchset? The one where you
> kept complaining about "maintenancy overhead"?

Yes, I still maintain my opinion on that approach. I have never
questioned usefulness of the information.

> We can plug that into the show_mem report too, and list the top 10
> allocations by file and line number.
> 
> > Maybe your proposal can be extended that way but the crucial point is to
> > not dump all sorts of random shrinkers' state and end up with unwieldy
> > reports.  If, on the other hand, any particular shrinker struggles to
> > reclaim memory and it is sitting on a lot of memory it could be able to
> > flag itself to be involved in the dump.
> 
> Great, since as was mentioned in the original commit message it's not
> "all sorts of random shrinkers", but top 10 by objects reported, what
> I've got here should make you happy.

Can we do better and make that a shrinker decision rather than an
arbitrary top N selection? The thing is that shrinkers might even not
matter in many cases so their output would be just a balast. The number
of objects is not universaly great choice. As Dave mentioned metdata
might be pinning other objects.

That being said, if you want to give more debugability power to
shrinkers then it makes more sense to allow them to opt-in for the oom
report rather than control which of them to involve from the oom
reporting code which doesn't have enough context on its own.

Kent Overstreet Dec. 4, 2023, 6:15 p.m. UTC | #29

On Mon, Dec 04, 2023 at 11:33:12AM +0100, Michal Hocko wrote:
> On Fri 01-12-23 16:25:06, Kent Overstreet wrote:
> > On Fri, Dec 01, 2023 at 11:04:23AM +0100, Michal Hocko wrote:
> > > On Thu 30-11-23 20:47:45, Kent Overstreet wrote:
> > > > On Thu, Nov 30, 2023 at 09:14:35AM +0100, Michal Hocko wrote:
> > > [...]
> > > > > All that being said, I am with you on the fact that the oom report in
> > > > > its current form could see improvements.
> > > > 
> > > > I'm glad we're finally in agreement on something!
> > > > 
> > > > If you want to share your own ideas on what could be improved and what
> > > > you find useful, maybe we could find some more common ground.
> > > 
> > > One thing that I would consider an improvement is to have a way to
> > > subscribe drivers with excessive memory consumption or those which are
> > > struggling to dump their state.
> > 
> > Remember the memory allocation profiling patchset? The one where you
> > kept complaining about "maintenancy overhead"?
> 
> Yes, I still maintain my opinion on that approach. I have never
> questioned usefulness of the information.
> 
> > We can plug that into the show_mem report too, and list the top 10
> > allocations by file and line number.
> > 
> > > Maybe your proposal can be extended that way but the crucial point is to
> > > not dump all sorts of random shrinkers' state and end up with unwieldy
> > > reports.  If, on the other hand, any particular shrinker struggles to
> > > reclaim memory and it is sitting on a lot of memory it could be able to
> > > flag itself to be involved in the dump.
> > 
> > Great, since as was mentioned in the original commit message it's not
> > "all sorts of random shrinkers", but top 10 by objects reported, what
> > I've got here should make you happy.
> 
> Can we do better and make that a shrinker decision rather than an
> arbitrary top N selection? The thing is that shrinkers might even not
> matter in many cases so their output would be just a balast. The number
> of objects is not universaly great choice. As Dave mentioned metdata
> might be pinning other objects.
> 
> That being said, if you want to give more debugability power to
> shrinkers then it makes more sense to allow them to opt-in for the oom
> report rather than control which of them to involve from the oom
> reporting code which doesn't have enough context on its own.

If you've got an idea for a refinement, please submit your own patch and
I'll look at incorporating it into the series.

Michal Hocko Dec. 5, 2023, 8:49 a.m. UTC | #30

On Mon 04-12-23 13:15:53, Kent Overstreet wrote:
> On Mon, Dec 04, 2023 at 11:33:12AM +0100, Michal Hocko wrote:
> > On Fri 01-12-23 16:25:06, Kent Overstreet wrote:
> > > On Fri, Dec 01, 2023 at 11:04:23AM +0100, Michal Hocko wrote:
> > > > On Thu 30-11-23 20:47:45, Kent Overstreet wrote:
> > > > > On Thu, Nov 30, 2023 at 09:14:35AM +0100, Michal Hocko wrote:
> > > > [...]
> > > > > > All that being said, I am with you on the fact that the oom report in
> > > > > > its current form could see improvements.
> > > > > 
> > > > > I'm glad we're finally in agreement on something!
> > > > > 
> > > > > If you want to share your own ideas on what could be improved and what
> > > > > you find useful, maybe we could find some more common ground.
> > > > 
> > > > One thing that I would consider an improvement is to have a way to
> > > > subscribe drivers with excessive memory consumption or those which are
> > > > struggling to dump their state.
> > > 
> > > Remember the memory allocation profiling patchset? The one where you
> > > kept complaining about "maintenancy overhead"?
> > 
> > Yes, I still maintain my opinion on that approach. I have never
> > questioned usefulness of the information.
> > 
> > > We can plug that into the show_mem report too, and list the top 10
> > > allocations by file and line number.
> > > 
> > > > Maybe your proposal can be extended that way but the crucial point is to
> > > > not dump all sorts of random shrinkers' state and end up with unwieldy
> > > > reports.  If, on the other hand, any particular shrinker struggles to
> > > > reclaim memory and it is sitting on a lot of memory it could be able to
> > > > flag itself to be involved in the dump.
> > > 
> > > Great, since as was mentioned in the original commit message it's not
> > > "all sorts of random shrinkers", but top 10 by objects reported, what
> > > I've got here should make you happy.
> > 
> > Can we do better and make that a shrinker decision rather than an
> > arbitrary top N selection? The thing is that shrinkers might even not
> > matter in many cases so their output would be just a balast. The number
> > of objects is not universaly great choice. As Dave mentioned metdata
> > might be pinning other objects.
> > 
> > That being said, if you want to give more debugability power to
> > shrinkers then it makes more sense to allow them to opt-in for the oom
> > report rather than control which of them to involve from the oom
> > reporting code which doesn't have enough context on its own.
> 
> If you've got an idea for a refinement, please submit your own patch and
> I'll look at incorporating it into the series.

OK, noted. Let me just remind you that /me as a reviewer have pointed
out several shortcomings of your proposed solution. From my POV this is
not something we should merge in its current form.

I am not going to comment further in this email thread as it seems you
are not interested in the review feedback and I have better things to do
than talking to /dev/null. This is not the first time you act like that
and I really do recommend you to think about your attitude and
communication style.

Good luck!

Kent Overstreet Dec. 5, 2023, 11:21 p.m. UTC | #31

On Tue, Dec 05, 2023 at 09:49:15AM +0100, Michal Hocko wrote:
> OK, noted. Let me just remind you that /me as a reviewer have pointed
> out several shortcomings of your proposed solution. From my POV this is
> not something we should merge in its current form.
> 
> I am not going to comment further in this email thread as it seems you
> are not interested in the review feedback and I have better things to do
> than talking to /dev/null. This is not the first time you act like that
> and I really do recommend you to think about your attitude and
> communication style.

Michal, I was extending an offer to you to contribute; as programmers,
the primary idea we share our ideas is through code.

This kind of lecturing is pretty hard for me to respond to, so let's
just try to stay out of each other's way from here on out.

Thanks,
Kent

Dave Chinner Dec. 6, 2023, 8:16 a.m. UTC | #32

On Fri, Dec 01, 2023 at 12:01:33PM -0800, Roman Gushchin wrote:
> On Fri, Dec 01, 2023 at 12:18:44PM +1100, Dave Chinner wrote:
> > On Thu, Nov 30, 2023 at 11:01:23AM -0800, Roman Gushchin wrote:
> > > On Wed, Nov 29, 2023 at 10:21:49PM -0500, Kent Overstreet wrote:
> > > > On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote:
> > > > > For non-bcachefs developers, who knows what those statistics mean?
> > 
> > > Ok, a simple question then:
> > > why can't you dump /proc/slabinfo after the OOM?
> > 
> > Taken to it's logical conclusion, we arrive at:
> > 
> > 	OOM-kill doesn't need to output anything at all except for
> > 	what it killed because we can dump
> > 	/proc/{mem,zone,vmalloc,buddy,slab}info after the OOM....
> > 
> > As it is, even asking such a question shows that you haven't looked
> > at the OOM kill output for a long time - it already reports the slab
> > cache usage information for caches that are reclaimable.
> > 
> > That is, if too much accounted slab cache based memory consumption
> > is detected at OOM-kill, it will calldump_unreclaimable_slab() to
> > dump all the SLAB_RECLAIM_ACCOUNT caches (i.e. those with shrinkers)
> > to the console as part of the OOM-kill output.
> 
> You are right, I missed that, partially because most of OOM's I had to deal
> with recently were memcg OOM's.
> 
> This changes my perspective at Kent's patches, if we dump this information
> already, it might be not a bad idea to do it nicer. So I take my words back
> here.
> 
> > 
> > The problem Kent is trying to address is that this output *isn't
> > sufficient to debug shrinker based memory reclaim issues*. It hasn't
> > been for a long time, and so we've all got our own special debug
> > patches and methods for checking that shrinkers are doing what they
> > are supposed to. Kent is trying to formalise one of the more useful
> > general methods for exposing that internal information when OOM
> > occurs...
> > 
> > Indeed, I can think of several uses for a shrinker->to_text() output
> > that we simply cannot do right now.
> > 
> > Any shrinker that does garbage collection on something that is not a
> > pure slab cache (e.g. xfs buffer cache, xfs inode gc subsystem,
> > graphics memory allocators, binder, etc) has no visibility of the
> > actuall memory being used by the subsystem in the OOM-kill output.
> > This information isn't in /proc/slabinfo, it's not accounted by a
> > SLAB_RECLAIM_ACCOUNT cache, and it's not accounted by anything in
> > the core mm statistics.
> > 
> > e.g. How does anyone other than a XFS expert know that the 500k of
> > active xfs_buf handles in the slab cache actually pins 15GB of
> > cached metadata allocated directly from the page allocator, not just
> > the 150MB of slab cache the handles take up?
> > 
> > Another example is that an inode can pin lots of heap memory (e.g.
> > for in-memory extent lists) and that may not be freeable until the
> > inode is reclaimed. So while the slab cache might not be excesively
> > large, we might have an a million inodes with a billion cumulative
> > extents cached in memory and it is the heap memory consumed by the
> > cached extents that is consuming the 30GB of "missing" kernel memory
> > that is causing OOM-kills to occur.
> > 
> > How is a user or developer supposed to know when one of these
> > situations has occurred given the current lack of memory usage
> > introspection into subsystems?
> 
> What would be the proper solution to this problem from your point of view?
> What functionality/API mm can provide to make the life of fs developers
> better here?

What can we do better?

The first thing we can do better that comes to mind is to merge
Kent's patches that allow the shrinker owner to output debug
information when requested by the infrastructure.

Then we - the shrinker implementers - have some control of our own
destiny.  We can add whatever we need to solve shrinker and OOM
problems realted to our shrinkers not doing the right thing.

But without that callout from the infrastructure and the
infrastructure to drive it at appropriate times, we will make zero
progress improving the situation. 

Yes, the code may not be perfect and, yes, it may not be useful to
mm developers, but for the people who have to debug shrinker related
problems in production systems we need all the help we can get. We
certainly don't care if it isn't perfect, just having something we
can partially tailor to our iindividual needs is far, far better
than the current situation of nothing at all...

-Dave.

Kent Overstreet Dec. 6, 2023, 7:13 p.m. UTC | #33

On Wed, Dec 06, 2023 at 07:16:04PM +1100, Dave Chinner wrote:
> On Fri, Dec 01, 2023 at 12:01:33PM -0800, Roman Gushchin wrote:
> > What would be the proper solution to this problem from your point of view?
> > What functionality/API mm can provide to make the life of fs developers
> > better here?
> 
> What can we do better?
> 
> The first thing we can do better that comes to mind is to merge
> Kent's patches that allow the shrinker owner to output debug
> information when requested by the infrastructure.
> 
> Then we - the shrinker implementers - have some control of our own
> destiny.  We can add whatever we need to solve shrinker and OOM
> problems realted to our shrinkers not doing the right thing.
> 
> But without that callout from the infrastructure and the
> infrastructure to drive it at appropriate times, we will make zero
> progress improving the situation. 
> 
> Yes, the code may not be perfect and, yes, it may not be useful to
> mm developers, but for the people who have to debug shrinker related
> problems in production systems we need all the help we can get. We
> certainly don't care if it isn't perfect, just having something we
> can partially tailor to our iindividual needs is far, far better
> than the current situation of nothing at all...

Of course if mm people don't want it I've got better things to do than
fight uphill battles just to get some reviewed-by tags. Real high
quality review feedback in this thread.

Roman Gushchin Dec. 9, 2023, 1:44 a.m. UTC | #34

On Wed, Dec 06, 2023 at 02:13:49PM -0500, Kent Overstreet wrote:
0;95;0c> On Wed, Dec 06, 2023 at 07:16:04PM +1100, Dave Chinner wrote:
> > On Fri, Dec 01, 2023 at 12:01:33PM -0800, Roman Gushchin wrote:
> > > What would be the proper solution to this problem from your point of view?
> > > What functionality/API mm can provide to make the life of fs developers
> > > better here?
> > 
> > What can we do better?
> > 
> > The first thing we can do better that comes to mind is to merge
> > Kent's patches that allow the shrinker owner to output debug
> > information when requested by the infrastructure.
> > 
> > Then we - the shrinker implementers - have some control of our own
> > destiny.  We can add whatever we need to solve shrinker and OOM
> > problems realted to our shrinkers not doing the right thing.
> > 
> > But without that callout from the infrastructure and the
> > infrastructure to drive it at appropriate times, we will make zero
> > progress improving the situation. 
> > 
> > Yes, the code may not be perfect and, yes, it may not be useful to
> > mm developers, but for the people who have to debug shrinker related
> > problems in production systems we need all the help we can get. We
> > certainly don't care if it isn't perfect, just having something we
> > can partially tailor to our iindividual needs is far, far better
> > than the current situation of nothing at all...
> 
> Of course if mm people don't want it I've got better things to do than
> fight uphill battles just to get some reviewed-by tags. Real high
> quality review feedback in this thread.

(ignoring an attack towards all mm people, sigh)

Kent, I think extending the shrinker debugfs interface is useful.
And in this context there is no need to limit the output to 10 items.
Also most of disagreements will vanish (sorry, if I missed something,
but looks like all concerns are related to the oom part).
Will it work for you?

If yes, would you be able to drop the oom part (at least for now)
and resend the patchset?

Thanks!

Kent Overstreet Dec. 9, 2023, 2:04 a.m. UTC | #35

On Fri, Dec 08, 2023 at 05:44:16PM -0800, Roman Gushchin wrote:
> On Wed, Dec 06, 2023 at 02:13:49PM -0500, Kent Overstreet wrote:
> 0;95;0c> On Wed, Dec 06, 2023 at 07:16:04PM +1100, Dave Chinner wrote:
> > > On Fri, Dec 01, 2023 at 12:01:33PM -0800, Roman Gushchin wrote:
> > > > What would be the proper solution to this problem from your point of view?
> > > > What functionality/API mm can provide to make the life of fs developers
> > > > better here?
> > > 
> > > What can we do better?
> > > 
> > > The first thing we can do better that comes to mind is to merge
> > > Kent's patches that allow the shrinker owner to output debug
> > > information when requested by the infrastructure.
> > > 
> > > Then we - the shrinker implementers - have some control of our own
> > > destiny.  We can add whatever we need to solve shrinker and OOM
> > > problems realted to our shrinkers not doing the right thing.
> > > 
> > > But without that callout from the infrastructure and the
> > > infrastructure to drive it at appropriate times, we will make zero
> > > progress improving the situation. 
> > > 
> > > Yes, the code may not be perfect and, yes, it may not be useful to
> > > mm developers, but for the people who have to debug shrinker related
> > > problems in production systems we need all the help we can get. We
> > > certainly don't care if it isn't perfect, just having something we
> > > can partially tailor to our iindividual needs is far, far better
> > > than the current situation of nothing at all...
> > 
> > Of course if mm people don't want it I've got better things to do than
> > fight uphill battles just to get some reviewed-by tags. Real high
> > quality review feedback in this thread.
> 
> (ignoring an attack towards all mm people, sigh)
> 
> Kent, I think extending the shrinker debugfs interface is useful.
> And in this context there is no need to limit the output to 10 items.
> Also most of disagreements will vanish (sorry, if I missed something,
> but looks like all concerns are related to the oom part).
> Will it work for you?
> 
> If yes, would you be able to drop the oom part (at least for now)
> and resend the patchset?

No: OOM debugging is the entire point of the patchset.

[2/7] mm: shrinker: Add a .to_text() method for shrinkers

Commit Message

Comments

Patch