[1/3] mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES

Message ID	20180305133743.12746-2-guro@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Roman Gushchin <guro@fb.com> To: <linux-mm@kvack.org> CC: Roman Gushchin <guro@fb.com>, Andrew Morton <akpm@linux-foundation.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Michal Hocko <mhocko@suse.com>, Johannes Weiner <hannes@cmpxchg.org>, <linux-fsdevel@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <kernel-team@fb.com> Subject: [PATCH 1/3] mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES Date: Mon, 5 Mar 2018 13:37:40 +0000 Message-ID: <20180305133743.12746-2-guro@fb.com> In-Reply-To: <20180305133743.12746-1-guro@fb.com> References: <20180305133743.12746-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: None (protection.outlook.com: fb.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; BL2PR15MB1076; 23:hM8CsiWjJ/S+rXPcXoTkDw5b8cHgGvv8LDTd7RlPq?= =?us-ascii?Q?l0u+TIcwXc09Hw4t1B31GeCxUU2AmhPVloYRR4tGLSVcE8lJarfSadjH/HA5?= =?us-ascii?Q?EdUtI41OqxU1gQr1+s4Ry1GTH3bEr5r28oQaDoe+pEfv/7bnPYqYFhGqBeM/?= =?us-ascii?Q?IZw4j7uVN+FEPfP/oSlgBpxtVWJA9V2zoM8g90TI+Shv4rYfw6esLqkrqbc9?= =?us-ascii?Q?gav/+HbxGXfV6ujPwNbl6OoHHgr2vQh2uPVe5H+m5oTq2S5HoNVtGKX8doyO?= =?us-ascii?Q?JzUUfFtysRFjVrGMX3NfT9+MQLNNl5AF6BTT5CWAHYiSVxM1Qlmqwvj6/pvi?= =?us-ascii?Q?8e1frWC/NtJfjnocWmZoh0P711xOywETx6lM99OHeDxhEUSxffTsZ9sg62rC?= =?us-ascii?Q?UcotZ0iWW0KyWeSfxOK+NZK8V40MbUXT+Rl+Cz6+ApaBBbxF7XNeDkO9sJtI?= =?us-ascii?Q?DJ7AwdrxamZuMVIOfHMBZgSsGr6yEqRVQ8QVWlDMKsDOc0hnSidesRy5S/mA?= =?us-ascii?Q?HUXAfOojwogHYEkOpN/K72/OI8sykzMFjGTmKOGKaUa5xxq+uWYd/alX3utG?= =?us-ascii?Q?xdgAIco6uj0s3Gc8LD7BbpowcwHtDbmKs6XN7nu4IeglcR4d0I+XlRwYCOnx?= =?us-ascii?Q?UCIeLNgf+FI+y/EF4HQ2bXEjOvDUMYv3jkN2t8fomIGukDO8rn56xM145ifB?= =?us-ascii?Q?lYjU6outwv/QacOLzoB6GLZ1n+Ny7n/R1CLhctgXxpF2FIh0nRIfuil9ul5J?= =?us-ascii?Q?s5lClvL9M8FFbGUiIhVRW0InlHdxoSuf8x4Y6NKFCh32lTbUa0MCnFSAqOkd?= =?us-ascii?Q?mrqsuPF65uLrpSElRmnrdVuQ6edvsJoewkUXbhqQduAYBDZa6lDvf34m4tMV?= =?us-ascii?Q?cdSIU7CUCmQSHq+xqWMsYM1UQNWRZ6fxFfBq7LUzMABdjBglg+IbkUn7Z12k?= =?us-ascii?Q?Dg5fOMWTusR+bmH4S6mHRuwpXQm26UatwhVd2SbUvGFS/842SF6dpJlDuyvH?= =?us-ascii?Q?cTZfla6RbT7M7HJF0McJTU4aWviOUw8rV3+fWAF4HYsUVxhIoC4pTyy46Poz?= =?us-ascii?Q?koid3L8x/IWYWwjnLgFn95OeURmcHG5zS0Lbw0gU75vlCqKhN+rBqrlNBCei?= =?us-ascii?Q?Yho9VWMblDC6e50CAONw/Ahb6L0RlrWEN/ame1IlXQqDDpFyfYy1/PYM/u6+?= =?us-ascii?Q?mCD3fPhJG2DnJt/1KrMjAZwXoiF234Yq1TMZJqSZ/ncoByzmDcF0QhPpA=3D?= =?us-ascii?Q?=3D?= X-Microsoft-Antispam-Message-Info: x0oi6CWW4o6FCWhYXDKOEtnZtEE8hol7VUykFhIWDqHTv77TsRvAUqiSnhrZpRXKd7MybsDzc+zbDIRGntWwhemAC3guJS6OzFV2dTeH3iKWzpbNaJghYwn61An+hoU7+OV4KdU1zG9yU6Aij96gwrwQwp6Mwxk9nWpw1ZKd3JTBVfo6TE+goo428RZ7qM+N X-Microsoft-Exchange-Diagnostics: 1; BL2PR15MB1076; 6:cTMHpDX7Bu7yIhXlkrkEymqulctJ46kdZaVEW/Rs2MJl7+Je+9J3zbS8592AhkaKaLuLn/sJdVqL6+wmhPh0iLKZp1qb+xfTcQg8s4oWjWOTiQ/sjEkEWS0Ri+IV3hmfDj5cLHxiKTN8EoeZ01cJFa+CEIHrZE11Wz1tImKY1YmLrhKWv4hCfFhlnqZ90oT4nWU3Mmpy7mVBgbIGybVt3DJHwHLyzcHcFoogMpu85/JCSSNs8Mol330OHa9x8bywpjFJWpGWcJ4ZZqDfN3D9WhLyG97ZlHJH1+A5ZYsw+0CWgKs1vTDeNevbmnaOZzsapuyKHBWDPq/iOAvpOrf3F/9ksQchp9CO0VVMq6YUfq0=; 5:Q08TXyoMVCEqO3lC87YfYAnmbVVkU2UyHEoSdEqCFM3ZF4bwZXXd+0229JjA8qC2lHs+BFlOtaVJ1pEfSQmnBWTFEI8XocFev039qH0tQ3NSs633Xv2+nwbGOFCMrFmh8NI2FW7B7n7MDy3r3TemyQTF/ZziuEIwKi3XpfSK2DE=; 24:d6Kf2Wm2D5X6kdUk3Pbb8+cmZoassR7QsPfldqZWQ6KgRIbdRYYf+2tmJ1dIErR5zq8zhuryfTwrbx+Ge0PEbOCr9A6agLoYzVH186OplmM=; 7:MWb/4HDqh5KfYjZjJuvqx+QCZ9e/Qa3dTeNAYpR2fR26752tunbWXySn+gkgLtH0CngpWUFNZwhDtlE+Dcxm17vsE56eo9KaBq9WtV63Ku5f3ZDObRrxqsvI3xYFSjLJ4RqqfGcnSEBoXcT98MZcialK5OhYWP49kdRMCOKlOvy60Lx8ychAMdsuv39Akf4YpmVMKkC21i/ue4pBjdQNyvcF5Jb0YZAxhY/rhFDhhgDqOnBoKIv1FQDoCWOCkC/b SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; BL2PR15MB1076; 20:3fEO9p+t1n70M+59vg3cvBswB6XIgg9q1iXk/okJsAmGWDp8rcUwd9AnDF05YccMIw1zVFuRXrZGBQfNlzkk4UaBa96bxLLKWZCQ16ONjio91k5Gq8vefrAIx/e8LqmOYhKAYRyVwzJ1NZ9n1TfBZ1FOw9dKr2xDzKJeEegulTY= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 05 Mar 2018 13:38:06.6135 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 3af82e2d-badf-415f-14de-08d5829e54be X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Roman Gushchin March 5, 2018, 1:37 p.m. UTC

This patch introduces a concept of indirectly reclaimable memory
and adds the corresponding memory counter and /proc/vmstat item.

Indirectly reclaimable memory is any sort of memory, used by
the kernel (except of reclaimable slabs), which is actually
reclaimable, i.e. will be released under memory pressure.

The counter is in bytes, as it's not always possible to
count such objects in pages. The name contains BYTES
by analogy to NR_KERNEL_STACK_KB.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: kernel-team@fb.com
---
 include/linux/mmzone.h | 1 +
 mm/vmstat.c            | 1 +
 2 files changed, 2 insertions(+)

Vlastimil Babka April 11, 2018, 1:16 p.m. UTC | #1

[+CC linux-api]

On 03/05/2018 02:37 PM, Roman Gushchin wrote:
> This patch introduces a concept of indirectly reclaimable memory
> and adds the corresponding memory counter and /proc/vmstat item.
> 
> Indirectly reclaimable memory is any sort of memory, used by
> the kernel (except of reclaimable slabs), which is actually
> reclaimable, i.e. will be released under memory pressure.
> 
> The counter is in bytes, as it's not always possible to
> count such objects in pages. The name contains BYTES
> by analogy to NR_KERNEL_STACK_KB.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: kernel-team@fb.com

Hmm, looks like I'm late and this user-visible API change was just
merged. But it's for rc1, so we can still change it, hopefully?

One problem I see with the counter is that it's in bytes, but among
counters that use pages, and the name doesn't indicate it. Then, I don't
see why users should care about the "indirectly" part, as that's just an
implementation detail. It is reclaimable and that's what matters, right?
(I also wanted to complain about lack of Documentation/... update, but
looks like there's no general file about vmstat, ugh)

I also kind of liked the idea from v1 rfc posting that there would be a
separate set of reclaimable kmalloc-X caches for these kind of
allocations. Besides accounting, it should also help reduce memory
fragmentation. The right variant of cache would be detected via
__GFP_RECLAIMABLE.

With that in mind, can we at least for now put the (manually maintained)
byte counter in a variable that's not directly exposed via /proc/vmstat,
and then when printing nr_slab_reclaimable, simply add the value
(divided by PAGE_SIZE), and when printing nr_slab_unreclaimable,
subtract the same value. This way we would be simply making the existing
counters more precise, in line with their semantics.

Thoughts?
Vlastimil

> ---
>  include/linux/mmzone.h | 1 +
>  mm/vmstat.c            | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e09fe563d5dc..15e783f29e21 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -180,6 +180,7 @@ enum node_stat_item {
>  	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
>  	NR_DIRTIED,		/* page dirtyings since bootup */
>  	NR_WRITTEN,		/* page writings since bootup */
> +	NR_INDIRECTLY_RECLAIMABLE_BYTES, /* measured in bytes */
>  	NR_VM_NODE_STAT_ITEMS
>  };
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 40b2db6db6b1..b6b5684f31fe 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1161,6 +1161,7 @@ const char * const vmstat_text[] = {
>  	"nr_vmscan_immediate_reclaim",
>  	"nr_dirtied",
>  	"nr_written",
> +	"nr_indirectly_reclaimable",
>  
>  	/* enum writeback_stat_item counters */
>  	"nr_dirty_threshold",
>

Roman Gushchin April 11, 2018, 1:56 p.m. UTC | #2

On Wed, Apr 11, 2018 at 03:16:08PM +0200, Vlastimil Babka wrote:
> [+CC linux-api]
> 
> On 03/05/2018 02:37 PM, Roman Gushchin wrote:
> > This patch introduces a concept of indirectly reclaimable memory
> > and adds the corresponding memory counter and /proc/vmstat item.
> > 
> > Indirectly reclaimable memory is any sort of memory, used by
> > the kernel (except of reclaimable slabs), which is actually
> > reclaimable, i.e. will be released under memory pressure.
> > 
> > The counter is in bytes, as it's not always possible to
> > count such objects in pages. The name contains BYTES
> > by analogy to NR_KERNEL_STACK_KB.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Cc: kernel-team@fb.com
> 
> Hmm, looks like I'm late and this user-visible API change was just
> merged. But it's for rc1, so we can still change it, hopefully?
> 
> One problem I see with the counter is that it's in bytes, but among
> counters that use pages, and the name doesn't indicate it.

Here I just followed "nr_kernel_stack" path, which is measured in kB,
but this is not mentioned in the field name.

> Then, I don't
> see why users should care about the "indirectly" part, as that's just an
> implementation detail. It is reclaimable and that's what matters, right?
> (I also wanted to complain about lack of Documentation/... update, but
> looks like there's no general file about vmstat, ugh)

I agree, that it's a bit weird, and it's probably better to not expose
it at all; but this is how all vm counters work. We do expose them all
in /proc/vmstat. A good number of them is useless until you are not a
mm developer, so it's arguable more "debug info" rather than "api".
It's definitely not a reason to make them messy.
Does "nr_indirectly_reclaimable_bytes" look better to you?

> 
> I also kind of liked the idea from v1 rfc posting that there would be a
> separate set of reclaimable kmalloc-X caches for these kind of
> allocations. Besides accounting, it should also help reduce memory
> fragmentation. The right variant of cache would be detected via
> __GFP_RECLAIMABLE.

Well, the downside is that we have to introduce X new caches
just for this particular problem. I'm not strictly against the idea,
but not convinced that it's much better.

> 
> With that in mind, can we at least for now put the (manually maintained)
> byte counter in a variable that's not directly exposed via /proc/vmstat,
> and then when printing nr_slab_reclaimable, simply add the value
> (divided by PAGE_SIZE), and when printing nr_slab_unreclaimable,
> subtract the same value. This way we would be simply making the existing
> counters more precise, in line with their semantics.

Idk, I don't like the idea of adding a counter outside of the vm counters
infrastructure, and I definitely wouldn't touch the exposed
nr_slab_reclaimable and nr_slab_unreclaimable fields.
We do have some stats in /proc/slabinfo, /proc/meminfo and /sys/kernel/slab
and I think that we should keep it consistent.

Thanks!

> 
> Thoughts?
> Vlastimil
> 
> > ---
> >  include/linux/mmzone.h | 1 +
> >  mm/vmstat.c            | 1 +
> >  2 files changed, 2 insertions(+)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index e09fe563d5dc..15e783f29e21 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -180,6 +180,7 @@ enum node_stat_item {
> >  	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
> >  	NR_DIRTIED,		/* page dirtyings since bootup */
> >  	NR_WRITTEN,		/* page writings since bootup */
> > +	NR_INDIRECTLY_RECLAIMABLE_BYTES, /* measured in bytes */
> >  	NR_VM_NODE_STAT_ITEMS
> >  };
> >  
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 40b2db6db6b1..b6b5684f31fe 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1161,6 +1161,7 @@ const char * const vmstat_text[] = {
> >  	"nr_vmscan_immediate_reclaim",
> >  	"nr_dirtied",
> >  	"nr_written",
> > +	"nr_indirectly_reclaimable",
> >  
> >  	/* enum writeback_stat_item counters */
> >  	"nr_dirty_threshold",
> > 
>

Vlastimil Babka April 12, 2018, 6:52 a.m. UTC | #3

On 04/11/2018 03:56 PM, Roman Gushchin wrote:
> On Wed, Apr 11, 2018 at 03:16:08PM +0200, Vlastimil Babka wrote:
>> [+CC linux-api]
>>
>> On 03/05/2018 02:37 PM, Roman Gushchin wrote:
>>> This patch introduces a concept of indirectly reclaimable memory
>>> and adds the corresponding memory counter and /proc/vmstat item.
>>>
>>> Indirectly reclaimable memory is any sort of memory, used by
>>> the kernel (except of reclaimable slabs), which is actually
>>> reclaimable, i.e. will be released under memory pressure.
>>>
>>> The counter is in bytes, as it's not always possible to
>>> count such objects in pages. The name contains BYTES
>>> by analogy to NR_KERNEL_STACK_KB.
>>>
>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Cc: linux-fsdevel@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Cc: linux-mm@kvack.org
>>> Cc: kernel-team@fb.com
>>
>> Hmm, looks like I'm late and this user-visible API change was just
>> merged. But it's for rc1, so we can still change it, hopefully?
>>
>> One problem I see with the counter is that it's in bytes, but among
>> counters that use pages, and the name doesn't indicate it.
> 
> Here I just followed "nr_kernel_stack" path, which is measured in kB,
> but this is not mentioned in the field name.

Oh, didn't know. Bad example to follow :P

>> Then, I don't
>> see why users should care about the "indirectly" part, as that's just an
>> implementation detail. It is reclaimable and that's what matters, right?
>> (I also wanted to complain about lack of Documentation/... update, but
>> looks like there's no general file about vmstat, ugh)
> 
> I agree, that it's a bit weird, and it's probably better to not expose
> it at all; but this is how all vm counters work. We do expose them all
> in /proc/vmstat. A good number of them is useless until you are not a
> mm developer, so it's arguable more "debug info" rather than "api".

Yeah the problem is that once tools start rely on them, they fall under
the "do not break userspace" rule, however we call them. So being
cautious and conservative can't hurt.

> It's definitely not a reason to make them messy.
> Does "nr_indirectly_reclaimable_bytes" look better to you?

It still has has the "indirecly" part and feels arbitrary :/

>>
>> I also kind of liked the idea from v1 rfc posting that there would be a
>> separate set of reclaimable kmalloc-X caches for these kind of
>> allocations. Besides accounting, it should also help reduce memory
>> fragmentation. The right variant of cache would be detected via
>> __GFP_RECLAIMABLE.
> 
> Well, the downside is that we have to introduce X new caches
> just for this particular problem. I'm not strictly against the idea,
> but not convinced that it's much better.

Maybe we can find more cases that would benefit from it. Heck, even slab
itself allocates some management structures from the generic kmalloc
caches, and if they are used for reclaimable caches, they could be
tracked as reclaimable as well.

>>
>> With that in mind, can we at least for now put the (manually maintained)
>> byte counter in a variable that's not directly exposed via /proc/vmstat,
>> and then when printing nr_slab_reclaimable, simply add the value
>> (divided by PAGE_SIZE), and when printing nr_slab_unreclaimable,
>> subtract the same value. This way we would be simply making the existing
>> counters more precise, in line with their semantics.
> 
> Idk, I don't like the idea of adding a counter outside of the vm counters
> infrastructure, and I definitely wouldn't touch the exposed
> nr_slab_reclaimable and nr_slab_unreclaimable fields.

We would be just making the reported values more precise wrt reality.

> We do have some stats in /proc/slabinfo, /proc/meminfo and /sys/kernel/slab
> and I think that we should keep it consistent.

Right, meminfo would be adjusted the same. slabinfo doesn't indicate
which caches are reclaimable, so there will be no change.
/sys/kernel/slab/cache/reclaim_account does, but I doubt anything will
break.

> Thanks!
> 
>>
>> Thoughts?
>> Vlastimil
>>
>>> ---
>>>  include/linux/mmzone.h | 1 +
>>>  mm/vmstat.c            | 1 +
>>>  2 files changed, 2 insertions(+)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index e09fe563d5dc..15e783f29e21 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -180,6 +180,7 @@ enum node_stat_item {
>>>  	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
>>>  	NR_DIRTIED,		/* page dirtyings since bootup */
>>>  	NR_WRITTEN,		/* page writings since bootup */
>>> +	NR_INDIRECTLY_RECLAIMABLE_BYTES, /* measured in bytes */
>>>  	NR_VM_NODE_STAT_ITEMS
>>>  };
>>>  
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index 40b2db6db6b1..b6b5684f31fe 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -1161,6 +1161,7 @@ const char * const vmstat_text[] = {
>>>  	"nr_vmscan_immediate_reclaim",
>>>  	"nr_dirtied",
>>>  	"nr_written",
>>> +	"nr_indirectly_reclaimable",
>>>  
>>>  	/* enum writeback_stat_item counters */
>>>  	"nr_dirty_threshold",
>>>
>>
>

Michal Hocko April 12, 2018, 11:52 a.m. UTC | #4

On Thu 12-04-18 08:52:52, Vlastimil Babka wrote:
> On 04/11/2018 03:56 PM, Roman Gushchin wrote:
> > On Wed, Apr 11, 2018 at 03:16:08PM +0200, Vlastimil Babka wrote:
[...]
> >> With that in mind, can we at least for now put the (manually maintained)
> >> byte counter in a variable that's not directly exposed via /proc/vmstat,
> >> and then when printing nr_slab_reclaimable, simply add the value
> >> (divided by PAGE_SIZE), and when printing nr_slab_unreclaimable,
> >> subtract the same value. This way we would be simply making the existing
> >> counters more precise, in line with their semantics.
> > 
> > Idk, I don't like the idea of adding a counter outside of the vm counters
> > infrastructure, and I definitely wouldn't touch the exposed
> > nr_slab_reclaimable and nr_slab_unreclaimable fields.

Why?

> We would be just making the reported values more precise wrt reality.

I was suggesting something similar in an earlier discussion. I am not
really happy about the new exposed counter either. It is just arbitrary
by name yet very specific for this particular usecase.

What is a poor user supposed to do with the new counter? Can this be
used for any calculations?

Roman Gushchin April 12, 2018, 2:38 p.m. UTC | #5

On Thu, Apr 12, 2018 at 01:52:17PM +0200, Michal Hocko wrote:
> On Thu 12-04-18 08:52:52, Vlastimil Babka wrote:
> > On 04/11/2018 03:56 PM, Roman Gushchin wrote:
> > > On Wed, Apr 11, 2018 at 03:16:08PM +0200, Vlastimil Babka wrote:
> [...]
> > >> With that in mind, can we at least for now put the (manually maintained)
> > >> byte counter in a variable that's not directly exposed via /proc/vmstat,
> > >> and then when printing nr_slab_reclaimable, simply add the value
> > >> (divided by PAGE_SIZE), and when printing nr_slab_unreclaimable,
> > >> subtract the same value. This way we would be simply making the existing
> > >> counters more precise, in line with their semantics.
> > > 
> > > Idk, I don't like the idea of adding a counter outside of the vm counters
> > > infrastructure, and I definitely wouldn't touch the exposed
> > > nr_slab_reclaimable and nr_slab_unreclaimable fields.
> 
> Why?

Both nr_slab_reclaimable and nr_slab_unreclaimable have a very simple
meaning: they are numbers of pages used by corresponding slab caches.

In the answer to the very first version of this patchset
Andrew suggested to generalize the idea to allow further
accounting of non-kmalloc() allocations.
I like the idea, even if don't have a good example right now.

The problem with external names existed for many years before
we've accidentally hit it, so if we don't have other examples
right now, it doesn't mean that we wouldn't have them in the future.

> 
> > We would be just making the reported values more precise wrt reality.
> 
> I was suggesting something similar in an earlier discussion. I am not
> really happy about the new exposed counter either. It is just arbitrary
> by name yet very specific for this particular usecase.
> 
> What is a poor user supposed to do with the new counter? Can this be
> used for any calculations?

For me the most important part is to fix the overcommit logic, because it's
a real security and production issue. Adjusting MemAvailable is important too.

I really open here for any concrete suggestions on how to do it without exporting
of a new value, and without adding too much complexity to the code
(e.g. skipping this particular mm counter on printing will be quite messy).

Thanks!

Michal Hocko April 12, 2018, 2:46 p.m. UTC | #6

On Thu 12-04-18 15:38:33, Roman Gushchin wrote:
> On Thu, Apr 12, 2018 at 01:52:17PM +0200, Michal Hocko wrote:
> > On Thu 12-04-18 08:52:52, Vlastimil Babka wrote:
> > > On 04/11/2018 03:56 PM, Roman Gushchin wrote:
> > > > On Wed, Apr 11, 2018 at 03:16:08PM +0200, Vlastimil Babka wrote:
> > [...]
> > > >> With that in mind, can we at least for now put the (manually maintained)
> > > >> byte counter in a variable that's not directly exposed via /proc/vmstat,
> > > >> and then when printing nr_slab_reclaimable, simply add the value
> > > >> (divided by PAGE_SIZE), and when printing nr_slab_unreclaimable,
> > > >> subtract the same value. This way we would be simply making the existing
> > > >> counters more precise, in line with their semantics.
> > > > 
> > > > Idk, I don't like the idea of adding a counter outside of the vm counters
> > > > infrastructure, and I definitely wouldn't touch the exposed
> > > > nr_slab_reclaimable and nr_slab_unreclaimable fields.
> > 
> > Why?
> 
> Both nr_slab_reclaimable and nr_slab_unreclaimable have a very simple
> meaning: they are numbers of pages used by corresponding slab caches.

Right, but if names are reclaimable then they should end up in the
reclaimable slabs and to be accounted as such. Objects themselves are
not sufficient to reclaim the accounted memory.

> In the answer to the very first version of this patchset
> Andrew suggested to generalize the idea to allow further
> accounting of non-kmalloc() allocations.
> I like the idea, even if don't have a good example right now.

Well, I have to disagree here. It sounds completely ad-hoc without
a reasoable semantic. Or how does it help users when they do not know
what is the indirect dependency and how to trigger it.

> The problem with external names existed for many years before
> we've accidentally hit it, so if we don't have other examples
> right now, it doesn't mean that we wouldn't have them in the future.
> 
> > 
> > > We would be just making the reported values more precise wrt reality.
> > 
> > I was suggesting something similar in an earlier discussion. I am not
> > really happy about the new exposed counter either. It is just arbitrary
> > by name yet very specific for this particular usecase.
> > 
> > What is a poor user supposed to do with the new counter? Can this be
> > used for any calculations?
> 
> For me the most important part is to fix the overcommit logic, because it's
> a real security and production issue.

Sure, the problem is ugly. Not the first one when the unaccounted kernel
allocation can eat a lot of memory. We have many other such. The usual
answer was to use kmemcg accounting.

Roman Gushchin April 12, 2018, 2:57 p.m. UTC | #7

On Thu, Apr 12, 2018 at 08:52:52AM +0200, Vlastimil Babka wrote:
> On 04/11/2018 03:56 PM, Roman Gushchin wrote:
> > On Wed, Apr 11, 2018 at 03:16:08PM +0200, Vlastimil Babka wrote:
> >> [+CC linux-api]
> >>
> >> On 03/05/2018 02:37 PM, Roman Gushchin wrote:
> >>> This patch introduces a concept of indirectly reclaimable memory
> >>> and adds the corresponding memory counter and /proc/vmstat item.
> >>>
> >>> Indirectly reclaimable memory is any sort of memory, used by
> >>> the kernel (except of reclaimable slabs), which is actually
> >>> reclaimable, i.e. will be released under memory pressure.
> >>>
> >>> The counter is in bytes, as it's not always possible to
> >>> count such objects in pages. The name contains BYTES
> >>> by analogy to NR_KERNEL_STACK_KB.
> >>>
> >>> Signed-off-by: Roman Gushchin <guro@fb.com>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> >>> Cc: Michal Hocko <mhocko@suse.com>
> >>> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >>> Cc: linux-fsdevel@vger.kernel.org
> >>> Cc: linux-kernel@vger.kernel.org
> >>> Cc: linux-mm@kvack.org
> >>> Cc: kernel-team@fb.com
> >>
> >> Hmm, looks like I'm late and this user-visible API change was just
> >> merged. But it's for rc1, so we can still change it, hopefully?
> >>
> >> One problem I see with the counter is that it's in bytes, but among
> >> counters that use pages, and the name doesn't indicate it.
> > 
> > Here I just followed "nr_kernel_stack" path, which is measured in kB,
> > but this is not mentioned in the field name.
> 
> Oh, didn't know. Bad example to follow :P
> 
> >> Then, I don't
> >> see why users should care about the "indirectly" part, as that's just an
> >> implementation detail. It is reclaimable and that's what matters, right?
> >> (I also wanted to complain about lack of Documentation/... update, but
> >> looks like there's no general file about vmstat, ugh)
> > 
> > I agree, that it's a bit weird, and it's probably better to not expose
> > it at all; but this is how all vm counters work. We do expose them all
> > in /proc/vmstat. A good number of them is useless until you are not a
> > mm developer, so it's arguable more "debug info" rather than "api".
> 
> Yeah the problem is that once tools start rely on them, they fall under
> the "do not break userspace" rule, however we call them. So being
> cautious and conservative can't hurt.
> 
> > It's definitely not a reason to make them messy.
> > Does "nr_indirectly_reclaimable_bytes" look better to you?
> 
> It still has has the "indirecly" part and feels arbitrary :/
> 
> >>
> >> I also kind of liked the idea from v1 rfc posting that there would be a
> >> separate set of reclaimable kmalloc-X caches for these kind of
> >> allocations. Besides accounting, it should also help reduce memory
> >> fragmentation. The right variant of cache would be detected via
> >> __GFP_RECLAIMABLE.
> > 
> > Well, the downside is that we have to introduce X new caches
> > just for this particular problem. I'm not strictly against the idea,
> > but not convinced that it's much better.
> 
> Maybe we can find more cases that would benefit from it. Heck, even slab
> itself allocates some management structures from the generic kmalloc
> caches, and if they are used for reclaimable caches, they could be
> tracked as reclaimable as well.

This is a good catch!

> 
> >>
> >> With that in mind, can we at least for now put the (manually maintained)
> >> byte counter in a variable that's not directly exposed via /proc/vmstat,
> >> and then when printing nr_slab_reclaimable, simply add the value
> >> (divided by PAGE_SIZE), and when printing nr_slab_unreclaimable,
> >> subtract the same value. This way we would be simply making the existing
> >> counters more precise, in line with their semantics.
> > 
> > Idk, I don't like the idea of adding a counter outside of the vm counters
> > infrastructure, and I definitely wouldn't touch the exposed
> > nr_slab_reclaimable and nr_slab_unreclaimable fields.
> 
> We would be just making the reported values more precise wrt reality.

It depends on if we believe that only slab memory can be reclaimable
or not. If yes, this is true, otherwise not.

My guess is that some drivers (e.g. networking) might have buffers,
which are reclaimable under mempressure, and are allocated using
the page allocator. But I have to look closer...

> > We do have some stats in /proc/slabinfo, /proc/meminfo and /sys/kernel/slab
> > and I think that we should keep it consistent.
> 
> Right, meminfo would be adjusted the same. slabinfo doesn't indicate
> which caches are reclaimable, so there will be no change.
> /sys/kernel/slab/cache/reclaim_account does, but I doubt anything will
> break.

It also can be found out of the corresponding directory name in sysfs:
$ ls -la /sys/kernel/slab/dentr*
lrwxrwxrwx. 1 root root 0 Apr 11 14:45 /sys/kernel/slab/dentry -> :aA-0000192
                                                                   ^
						this is the "reclaimable" flag
Not saying that something will break.

Thanks!

Michal Hocko April 13, 2018, 6:59 a.m. UTC | #8

On Thu 12-04-18 15:57:03, Roman Gushchin wrote:
> On Thu, Apr 12, 2018 at 08:52:52AM +0200, Vlastimil Babka wrote:
[...]
> > We would be just making the reported values more precise wrt reality.
> 
> It depends on if we believe that only slab memory can be reclaimable
> or not. If yes, this is true, otherwise not.
> 
> My guess is that some drivers (e.g. networking) might have buffers,
> which are reclaimable under mempressure, and are allocated using
> the page allocator. But I have to look closer...

Well, we have many direct page allocator users which are not accounted
in vmstat. Some of those use their specific accounting (e.g. network
buffers, some fs metadata a many others). In the ideal world MM layer
would know about those but...

Anyway, this particular case is quite clear, no? We _use_ kmalloc so
this is slab allocator. We just misaccount it.

vinayak menon April 13, 2018, 12:13 p.m. UTC | #9

On Thu, Apr 12, 2018 at 8:27 PM, Roman Gushchin <guro@fb.com> wrote:
> On Thu, Apr 12, 2018 at 08:52:52AM +0200, Vlastimil Babka wrote:
>> On 04/11/2018 03:56 PM, Roman Gushchin wrote:
>> > On Wed, Apr 11, 2018 at 03:16:08PM +0200, Vlastimil Babka wrote:
>> >> [+CC linux-api]
>> >>
>> >> On 03/05/2018 02:37 PM, Roman Gushchin wrote:
>> >>> This patch introduces a concept of indirectly reclaimable memory
>> >>> and adds the corresponding memory counter and /proc/vmstat item.
>> >>>
>> >>> Indirectly reclaimable memory is any sort of memory, used by
>> >>> the kernel (except of reclaimable slabs), which is actually
>> >>> reclaimable, i.e. will be released under memory pressure.
>> >>>
>> >>> The counter is in bytes, as it's not always possible to
>> >>> count such objects in pages. The name contains BYTES
>> >>> by analogy to NR_KERNEL_STACK_KB.
>> >>>
>> >>> Signed-off-by: Roman Gushchin <guro@fb.com>
>> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> >>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> >>> Cc: Michal Hocko <mhocko@suse.com>
>> >>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> >>> Cc: linux-fsdevel@vger.kernel.org
>> >>> Cc: linux-kernel@vger.kernel.org
>> >>> Cc: linux-mm@kvack.org
>> >>> Cc: kernel-team@fb.com
>> >>
>> >> Hmm, looks like I'm late and this user-visible API change was just
>> >> merged. But it's for rc1, so we can still change it, hopefully?
>> >>
>> >> One problem I see with the counter is that it's in bytes, but among
>> >> counters that use pages, and the name doesn't indicate it.
>> >
>> > Here I just followed "nr_kernel_stack" path, which is measured in kB,
>> > but this is not mentioned in the field name.
>>
>> Oh, didn't know. Bad example to follow :P
>>
>> >> Then, I don't
>> >> see why users should care about the "indirectly" part, as that's just an
>> >> implementation detail. It is reclaimable and that's what matters, right?
>> >> (I also wanted to complain about lack of Documentation/... update, but
>> >> looks like there's no general file about vmstat, ugh)
>> >
>> > I agree, that it's a bit weird, and it's probably better to not expose
>> > it at all; but this is how all vm counters work. We do expose them all
>> > in /proc/vmstat. A good number of them is useless until you are not a
>> > mm developer, so it's arguable more "debug info" rather than "api".
>>
>> Yeah the problem is that once tools start rely on them, they fall under
>> the "do not break userspace" rule, however we call them. So being
>> cautious and conservative can't hurt.
>>
>> > It's definitely not a reason to make them messy.
>> > Does "nr_indirectly_reclaimable_bytes" look better to you?
>>
>> It still has has the "indirecly" part and feels arbitrary :/
>>
>> >>
>> >> I also kind of liked the idea from v1 rfc posting that there would be a
>> >> separate set of reclaimable kmalloc-X caches for these kind of
>> >> allocations. Besides accounting, it should also help reduce memory
>> >> fragmentation. The right variant of cache would be detected via
>> >> __GFP_RECLAIMABLE.
>> >
>> > Well, the downside is that we have to introduce X new caches
>> > just for this particular problem. I'm not strictly against the idea,
>> > but not convinced that it's much better.
>>
>> Maybe we can find more cases that would benefit from it. Heck, even slab
>> itself allocates some management structures from the generic kmalloc
>> caches, and if they are used for reclaimable caches, they could be
>> tracked as reclaimable as well.
>
> This is a good catch!
>
>>
>> >>
>> >> With that in mind, can we at least for now put the (manually maintained)
>> >> byte counter in a variable that's not directly exposed via /proc/vmstat,
>> >> and then when printing nr_slab_reclaimable, simply add the value
>> >> (divided by PAGE_SIZE), and when printing nr_slab_unreclaimable,
>> >> subtract the same value. This way we would be simply making the existing
>> >> counters more precise, in line with their semantics.
>> >
>> > Idk, I don't like the idea of adding a counter outside of the vm counters
>> > infrastructure, and I definitely wouldn't touch the exposed
>> > nr_slab_reclaimable and nr_slab_unreclaimable fields.
>>
>> We would be just making the reported values more precise wrt reality.
>
> It depends on if we believe that only slab memory can be reclaimable
> or not. If yes, this is true, otherwise not.
>
> My guess is that some drivers (e.g. networking) might have buffers,
> which are reclaimable under mempressure, and are allocated using
> the page allocator. But I have to look closer...
>

One such case I have encountered is that of the ION page pool. The page pool
registers a shrinker. When not in any memory pressure page pool can go high
and thus cause an mmap to fail when OVERCOMMIT_GUESS is set. I can send
a patch to account ION page pool pages in NR_INDIRECTLY_RECLAIMABLE_BYTES.

Thanks,
Vinayak

Vijayanand Jitta April 25, 2018, 3:49 a.m. UTC | #10

On 4/13/2018 5:43 PM, vinayak menon wrote:
> On Thu, Apr 12, 2018 at 8:27 PM, Roman Gushchin <guro@fb.com> wrote:
>> On Thu, Apr 12, 2018 at 08:52:52AM +0200, Vlastimil Babka wrote:
>>> On 04/11/2018 03:56 PM, Roman Gushchin wrote:
>>>> On Wed, Apr 11, 2018 at 03:16:08PM +0200, Vlastimil Babka wrote:
>>>>> [+CC linux-api]
>>>>>
>>>>> On 03/05/2018 02:37 PM, Roman Gushchin wrote:
>>>>>> This patch introduces a concept of indirectly reclaimable memory
>>>>>> and adds the corresponding memory counter and /proc/vmstat item.
>>>>>>
>>>>>> Indirectly reclaimable memory is any sort of memory, used by
>>>>>> the kernel (except of reclaimable slabs), which is actually
>>>>>> reclaimable, i.e. will be released under memory pressure.
>>>>>>
>>>>>> The counter is in bytes, as it's not always possible to
>>>>>> count such objects in pages. The name contains BYTES
>>>>>> by analogy to NR_KERNEL_STACK_KB.
>>>>>>
>>>>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>>>>>> Cc: Michal Hocko <mhocko@suse.com>
>>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>>>> Cc: linux-fsdevel@vger.kernel.org
>>>>>> Cc: linux-kernel@vger.kernel.org
>>>>>> Cc: linux-mm@kvack.org
>>>>>> Cc: kernel-team@fb.com
>>>>>
>>>>> Hmm, looks like I'm late and this user-visible API change was just
>>>>> merged. But it's for rc1, so we can still change it, hopefully?
>>>>>
>>>>> One problem I see with the counter is that it's in bytes, but among
>>>>> counters that use pages, and the name doesn't indicate it.
>>>>
>>>> Here I just followed "nr_kernel_stack" path, which is measured in kB,
>>>> but this is not mentioned in the field name.
>>>
>>> Oh, didn't know. Bad example to follow :P
>>>
>>>>> Then, I don't
>>>>> see why users should care about the "indirectly" part, as that's just an
>>>>> implementation detail. It is reclaimable and that's what matters, right?
>>>>> (I also wanted to complain about lack of Documentation/... update, but
>>>>> looks like there's no general file about vmstat, ugh)
>>>>
>>>> I agree, that it's a bit weird, and it's probably better to not expose
>>>> it at all; but this is how all vm counters work. We do expose them all
>>>> in /proc/vmstat. A good number of them is useless until you are not a
>>>> mm developer, so it's arguable more "debug info" rather than "api".
>>>
>>> Yeah the problem is that once tools start rely on them, they fall under
>>> the "do not break userspace" rule, however we call them. So being
>>> cautious and conservative can't hurt.
>>>
>>>> It's definitely not a reason to make them messy.
>>>> Does "nr_indirectly_reclaimable_bytes" look better to you?
>>>
>>> It still has has the "indirecly" part and feels arbitrary :/
>>>
>>>>>
>>>>> I also kind of liked the idea from v1 rfc posting that there would be a
>>>>> separate set of reclaimable kmalloc-X caches for these kind of
>>>>> allocations. Besides accounting, it should also help reduce memory
>>>>> fragmentation. The right variant of cache would be detected via
>>>>> __GFP_RECLAIMABLE.
>>>>
>>>> Well, the downside is that we have to introduce X new caches
>>>> just for this particular problem. I'm not strictly against the idea,
>>>> but not convinced that it's much better.
>>>
>>> Maybe we can find more cases that would benefit from it. Heck, even slab
>>> itself allocates some management structures from the generic kmalloc
>>> caches, and if they are used for reclaimable caches, they could be
>>> tracked as reclaimable as well.
>>
>> This is a good catch!
>>
>>>
>>>>>
>>>>> With that in mind, can we at least for now put the (manually maintained)
>>>>> byte counter in a variable that's not directly exposed via /proc/vmstat,
>>>>> and then when printing nr_slab_reclaimable, simply add the value
>>>>> (divided by PAGE_SIZE), and when printing nr_slab_unreclaimable,
>>>>> subtract the same value. This way we would be simply making the existing
>>>>> counters more precise, in line with their semantics.
>>>>
>>>> Idk, I don't like the idea of adding a counter outside of the vm counters
>>>> infrastructure, and I definitely wouldn't touch the exposed
>>>> nr_slab_reclaimable and nr_slab_unreclaimable fields.
>>>
>>> We would be just making the reported values more precise wrt reality.
>>
>> It depends on if we believe that only slab memory can be reclaimable
>> or not. If yes, this is true, otherwise not.
>>
>> My guess is that some drivers (e.g. networking) might have buffers,
>> which are reclaimable under mempressure, and are allocated using
>> the page allocator. But I have to look closer...
>>
> 
> One such case I have encountered is that of the ION page pool. The page pool
> registers a shrinker. When not in any memory pressure page pool can go high
> and thus cause an mmap to fail when OVERCOMMIT_GUESS is set. I can send
> a patch to account ION page pool pages in NR_INDIRECTLY_RECLAIMABLE_BYTES.
> 
> Thanks,
> Vinayak
> 

As Vinayak mentioned NR_INDIRECTLY_RECLAIMABLE_BYTES can be used to solve the issue
with ION page pool when OVERCOMMIT_GUESS is set, the patch for the same can be 
found here https://lkml.org/lkml/2018/4/24/1288

Roman Gushchin April 25, 2018, 12:52 p.m. UTC | #11

On Wed, Apr 25, 2018 at 09:19:29AM +0530, Vijayanand Jitta wrote:
> >>>> Idk, I don't like the idea of adding a counter outside of the vm counters
> >>>> infrastructure, and I definitely wouldn't touch the exposed
> >>>> nr_slab_reclaimable and nr_slab_unreclaimable fields.
> >>>
> >>> We would be just making the reported values more precise wrt reality.
> >>
> >> It depends on if we believe that only slab memory can be reclaimable
> >> or not. If yes, this is true, otherwise not.
> >>
> >> My guess is that some drivers (e.g. networking) might have buffers,
> >> which are reclaimable under mempressure, and are allocated using
> >> the page allocator. But I have to look closer...
> >>
> > 
> > One such case I have encountered is that of the ION page pool. The page pool
> > registers a shrinker. When not in any memory pressure page pool can go high
> > and thus cause an mmap to fail when OVERCOMMIT_GUESS is set. I can send
> > a patch to account ION page pool pages in NR_INDIRECTLY_RECLAIMABLE_BYTES.

Perfect!
This is exactly what I've expected.

> > 
> > Thanks,
> > Vinayak
> > 
> 
> As Vinayak mentioned NR_INDIRECTLY_RECLAIMABLE_BYTES can be used to solve the issue
> with ION page pool when OVERCOMMIT_GUESS is set, the patch for the same can be 
> found here https://lkml.org/lkml/2018/4/24/1288

This makes perfect sense to me.

Please, fell free to add:
Acked-by: Roman Gushchin <guro@fb.com>

Thank you!

Vlastimil Babka April 25, 2018, 3:47 p.m. UTC | #12

On 04/25/2018 02:52 PM, Roman Gushchin wrote:
> On Wed, Apr 25, 2018 at 09:19:29AM +0530, Vijayanand Jitta wrote:
>>>>>> Idk, I don't like the idea of adding a counter outside of the vm counters
>>>>>> infrastructure, and I definitely wouldn't touch the exposed
>>>>>> nr_slab_reclaimable and nr_slab_unreclaimable fields.
>>>>>
>>>>> We would be just making the reported values more precise wrt reality.
>>>>
>>>> It depends on if we believe that only slab memory can be reclaimable
>>>> or not. If yes, this is true, otherwise not.
>>>>
>>>> My guess is that some drivers (e.g. networking) might have buffers,
>>>> which are reclaimable under mempressure, and are allocated using
>>>> the page allocator. But I have to look closer...
>>>>
>>>
>>> One such case I have encountered is that of the ION page pool. The page pool
>>> registers a shrinker. When not in any memory pressure page pool can go high
>>> and thus cause an mmap to fail when OVERCOMMIT_GUESS is set. I can send
>>> a patch to account ION page pool pages in NR_INDIRECTLY_RECLAIMABLE_BYTES.

FYI, we have discussed this at LSF/MM and agreed to try the kmalloc
reclaimable caches idea. The existing counter could then remain for page
allocator users such as ION. It's a bit weird to have it in bytes and
not pages then, IMHO. What if we hid it from /proc/vmstat now so it
doesn't become ABI, and later convert it to page granularity and expose
it under a name such as "nr_other_reclaimable" ?

Vlastimil

> Perfect!
> This is exactly what I've expected.
> 
>>>
>>> Thanks,
>>> Vinayak
>>>
>>
>> As Vinayak mentioned NR_INDIRECTLY_RECLAIMABLE_BYTES can be used to solve the issue
>> with ION page pool when OVERCOMMIT_GUESS is set, the patch for the same can be 
>> found here https://lkml.org/lkml/2018/4/24/1288
> 
> This makes perfect sense to me.
> 
> Please, fell free to add:
> Acked-by: Roman Gushchin <guro@fb.com>
> 
> Thank you!
>

Matthew Wilcox April 25, 2018, 3:55 p.m. UTC | #13

On Fri, Apr 13, 2018 at 05:43:39PM +0530, vinayak menon wrote:
> One such case I have encountered is that of the ION page pool. The page pool
> registers a shrinker. When not in any memory pressure page pool can go high
> and thus cause an mmap to fail when OVERCOMMIT_GUESS is set. I can send
> a patch to account ION page pool pages in NR_INDIRECTLY_RECLAIMABLE_BYTES.

Why not just account them as NR_SLAB_RECLAIMABLE?  I know it's not slab, but
other than that mis-naming, it seems like it'll do the right thing.

Roman Gushchin April 25, 2018, 4:48 p.m. UTC | #14

On Wed, Apr 25, 2018 at 05:47:26PM +0200, Vlastimil Babka wrote:
> On 04/25/2018 02:52 PM, Roman Gushchin wrote:
> > On Wed, Apr 25, 2018 at 09:19:29AM +0530, Vijayanand Jitta wrote:
> >>>>>> Idk, I don't like the idea of adding a counter outside of the vm counters
> >>>>>> infrastructure, and I definitely wouldn't touch the exposed
> >>>>>> nr_slab_reclaimable and nr_slab_unreclaimable fields.
> >>>>>
> >>>>> We would be just making the reported values more precise wrt reality.
> >>>>
> >>>> It depends on if we believe that only slab memory can be reclaimable
> >>>> or not. If yes, this is true, otherwise not.
> >>>>
> >>>> My guess is that some drivers (e.g. networking) might have buffers,
> >>>> which are reclaimable under mempressure, and are allocated using
> >>>> the page allocator. But I have to look closer...
> >>>>
> >>>
> >>> One such case I have encountered is that of the ION page pool. The page pool
> >>> registers a shrinker. When not in any memory pressure page pool can go high
> >>> and thus cause an mmap to fail when OVERCOMMIT_GUESS is set. I can send
> >>> a patch to account ION page pool pages in NR_INDIRECTLY_RECLAIMABLE_BYTES.
> 
> FYI, we have discussed this at LSF/MM and agreed to try the kmalloc
> reclaimable caches idea. The existing counter could then remain for page
> allocator users such as ION. It's a bit weird to have it in bytes and
> not pages then, IMHO. What if we hid it from /proc/vmstat now so it
> doesn't become ABI, and later convert it to page granularity and expose
> it under a name such as "nr_other_reclaimable" ?

I've nothing against hiding it from /proc/vmstat, as long as we keep
the counter in place and the main issue resolved.

Maybe it's better to add nr_reclaimable = nr_slab_reclaimable + nr_other_reclaimable,
which will have a simpler meaning that nr_other_reclaimable (what is other?).

Thanks!

Vlastimil Babka April 25, 2018, 4:59 p.m. UTC | #15

On 04/25/2018 05:55 PM, Matthew Wilcox wrote:
> On Fri, Apr 13, 2018 at 05:43:39PM +0530, vinayak menon wrote:
>> One such case I have encountered is that of the ION page pool. The page pool
>> registers a shrinker. When not in any memory pressure page pool can go high
>> and thus cause an mmap to fail when OVERCOMMIT_GUESS is set. I can send
>> a patch to account ION page pool pages in NR_INDIRECTLY_RECLAIMABLE_BYTES.
> 
> Why not just account them as NR_SLAB_RECLAIMABLE?  I know it's not slab, but
> other than that mis-naming, it seems like it'll do the right thing.

Hm I think it would be confusing for anyone trying to correlate the
number with /proc/slabinfo - the numbers there wouldn't add up.

Vlastimil Babka April 25, 2018, 5:02 p.m. UTC | #16

On 04/25/2018 06:48 PM, Roman Gushchin wrote:
> On Wed, Apr 25, 2018 at 05:47:26PM +0200, Vlastimil Babka wrote:
>> On 04/25/2018 02:52 PM, Roman Gushchin wrote:
>>> On Wed, Apr 25, 2018 at 09:19:29AM +0530, Vijayanand Jitta wrote:
>>>>>>>> Idk, I don't like the idea of adding a counter outside of the vm counters
>>>>>>>> infrastructure, and I definitely wouldn't touch the exposed
>>>>>>>> nr_slab_reclaimable and nr_slab_unreclaimable fields.
>>>>>>>
>>>>>>> We would be just making the reported values more precise wrt reality.
>>>>>>
>>>>>> It depends on if we believe that only slab memory can be reclaimable
>>>>>> or not. If yes, this is true, otherwise not.
>>>>>>
>>>>>> My guess is that some drivers (e.g. networking) might have buffers,
>>>>>> which are reclaimable under mempressure, and are allocated using
>>>>>> the page allocator. But I have to look closer...
>>>>>>
>>>>>
>>>>> One such case I have encountered is that of the ION page pool. The page pool
>>>>> registers a shrinker. When not in any memory pressure page pool can go high
>>>>> and thus cause an mmap to fail when OVERCOMMIT_GUESS is set. I can send
>>>>> a patch to account ION page pool pages in NR_INDIRECTLY_RECLAIMABLE_BYTES.
>>
>> FYI, we have discussed this at LSF/MM and agreed to try the kmalloc
>> reclaimable caches idea. The existing counter could then remain for page
>> allocator users such as ION. It's a bit weird to have it in bytes and
>> not pages then, IMHO. What if we hid it from /proc/vmstat now so it
>> doesn't become ABI, and later convert it to page granularity and expose
>> it under a name such as "nr_other_reclaimable" ?
> 
> I've nothing against hiding it from /proc/vmstat, as long as we keep
> the counter in place and the main issue resolved.

Sure.

> Maybe it's better to add nr_reclaimable = nr_slab_reclaimable + nr_other_reclaimable,
> which will have a simpler meaning that nr_other_reclaimable (what is other?).

"other" can be changed, sure. nr_reclaimable is possible if we change
slab to adjust that counter as well - vmstat code doesn't support
arbitrary calculations when printing.

> Thanks!
>

Roman Gushchin April 25, 2018, 5:23 p.m. UTC | #17

On Wed, Apr 25, 2018 at 07:02:42PM +0200, Vlastimil Babka wrote:
> On 04/25/2018 06:48 PM, Roman Gushchin wrote:
> > On Wed, Apr 25, 2018 at 05:47:26PM +0200, Vlastimil Babka wrote:
> >> On 04/25/2018 02:52 PM, Roman Gushchin wrote:
> >>> On Wed, Apr 25, 2018 at 09:19:29AM +0530, Vijayanand Jitta wrote:
> >>>>>>>> Idk, I don't like the idea of adding a counter outside of the vm counters
> >>>>>>>> infrastructure, and I definitely wouldn't touch the exposed
> >>>>>>>> nr_slab_reclaimable and nr_slab_unreclaimable fields.
> >>>>>>>
> >>>>>>> We would be just making the reported values more precise wrt reality.
> >>>>>>
> >>>>>> It depends on if we believe that only slab memory can be reclaimable
> >>>>>> or not. If yes, this is true, otherwise not.
> >>>>>>
> >>>>>> My guess is that some drivers (e.g. networking) might have buffers,
> >>>>>> which are reclaimable under mempressure, and are allocated using
> >>>>>> the page allocator. But I have to look closer...
> >>>>>>
> >>>>>
> >>>>> One such case I have encountered is that of the ION page pool. The page pool
> >>>>> registers a shrinker. When not in any memory pressure page pool can go high
> >>>>> and thus cause an mmap to fail when OVERCOMMIT_GUESS is set. I can send
> >>>>> a patch to account ION page pool pages in NR_INDIRECTLY_RECLAIMABLE_BYTES.
> >>
> >> FYI, we have discussed this at LSF/MM and agreed to try the kmalloc
> >> reclaimable caches idea. The existing counter could then remain for page
> >> allocator users such as ION. It's a bit weird to have it in bytes and
> >> not pages then, IMHO. What if we hid it from /proc/vmstat now so it
> >> doesn't become ABI, and later convert it to page granularity and expose
> >> it under a name such as "nr_other_reclaimable" ?
> > 
> > I've nothing against hiding it from /proc/vmstat, as long as we keep
> > the counter in place and the main issue resolved.
> 
> Sure.
> 
> > Maybe it's better to add nr_reclaimable = nr_slab_reclaimable + nr_other_reclaimable,
> > which will have a simpler meaning that nr_other_reclaimable (what is other?).
> 
> "other" can be changed, sure. nr_reclaimable is possible if we change
> slab to adjust that counter as well - vmstat code doesn't support
> arbitrary calculations when printing.

Sure, but even just hiding a value isn't that easy now.
So we have to touch the vmstat printing code anyway.

Thanks!

[1/3] mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES

Commit Message

Comments

Patch