diff mbox series

proc: Avoid costly high-order page allocations when reading proc files

Message ID 20250401073046.51121-1-laoar.shao@gmail.com (mailing list archive)
State New
Headers show
Series proc: Avoid costly high-order page allocations when reading proc files | expand

Commit Message

Yafang Shao April 1, 2025, 7:30 a.m. UTC
While investigating a kcompactd 100% CPU utilization issue in production, I
observed frequent costly high-order (order-6) page allocations triggered by
proc file reads from monitoring tools. This can be reproduced with a simple
test case:

  fd = open(PROC_FILE, O_RDONLY);
  size = read(fd, buff, 256KB);
  close(fd);

Although we should modify the monitoring tools to use smaller buffer sizes,
we should also enhance the kernel to prevent these expensive high-order
allocations.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
---
 fs/proc/proc_sysctl.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Kees Cook April 1, 2025, 2:01 p.m. UTC | #1
On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
>While investigating a kcompactd 100% CPU utilization issue in production, I
>observed frequent costly high-order (order-6) page allocations triggered by
>proc file reads from monitoring tools. This can be reproduced with a simple
>test case:
>
>  fd = open(PROC_FILE, O_RDONLY);
>  size = read(fd, buff, 256KB);
>  close(fd);
>
>Although we should modify the monitoring tools to use smaller buffer sizes,
>we should also enhance the kernel to prevent these expensive high-order
>allocations.
>
>Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>Cc: Josef Bacik <josef@toxicpanda.com>
>---
> fs/proc/proc_sysctl.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
>diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
>index cc9d74a06ff0..c53ba733bda5 100644
>--- a/fs/proc/proc_sysctl.c
>+++ b/fs/proc/proc_sysctl.c
>@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> 	error = -ENOMEM;
> 	if (count >= KMALLOC_MAX_SIZE)
> 		goto out;
>-	kbuf = kvzalloc(count + 1, GFP_KERNEL);
>+
>+	/*
>+	 * Use vmalloc if the count is too large to avoid costly high-order page
>+	 * allocations.
>+	 */
>+	if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
>+		kbuf = kvzalloc(count + 1, GFP_KERNEL);

Why not move this check into kvmalloc family?

>+	else
>+		kbuf = vmalloc(count + 1);

You dropped the zeroing. This must be vzalloc.

> 	if (!kbuf)
> 		goto out;
> 

Alternatively, why not force count to be <PAGE_SIZE? What uses >PAGE_SIZE writes in proc/sys?

-Kees
Yafang Shao April 1, 2025, 2:50 p.m. UTC | #2
On Tue, Apr 1, 2025 at 10:01 PM Kees Cook <kees@kernel.org> wrote:
>
>
>
> On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> >While investigating a kcompactd 100% CPU utilization issue in production, I
> >observed frequent costly high-order (order-6) page allocations triggered by
> >proc file reads from monitoring tools. This can be reproduced with a simple
> >test case:
> >
> >  fd = open(PROC_FILE, O_RDONLY);
> >  size = read(fd, buff, 256KB);
> >  close(fd);
> >
> >Although we should modify the monitoring tools to use smaller buffer sizes,
> >we should also enhance the kernel to prevent these expensive high-order
> >allocations.
> >
> >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >Cc: Josef Bacik <josef@toxicpanda.com>
> >---
> > fs/proc/proc_sysctl.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> >index cc9d74a06ff0..c53ba733bda5 100644
> >--- a/fs/proc/proc_sysctl.c
> >+++ b/fs/proc/proc_sysctl.c
> >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> >       error = -ENOMEM;
> >       if (count >= KMALLOC_MAX_SIZE)
> >               goto out;
> >-      kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >+
> >+      /*
> >+       * Use vmalloc if the count is too large to avoid costly high-order page
> >+       * allocations.
> >+       */
> >+      if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >+              kbuf = kvzalloc(count + 1, GFP_KERNEL);
>
> Why not move this check into kvmalloc family?

good suggestion.

>
> >+      else
> >+              kbuf = vmalloc(count + 1);
>
> You dropped the zeroing. This must be vzalloc.

Nice catch.

>
> >       if (!kbuf)
> >               goto out;
> >
>
> Alternatively, why not force count to be <PAGE_SIZE? What uses >PAGE_SIZE writes in proc/sys?

This would break backward compatibility with existing tools, so we
cannot enforce this restriction.
Harry Yoo April 2, 2025, 4:15 a.m. UTC | #3
On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> 
> 
> On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> >While investigating a kcompactd 100% CPU utilization issue in production, I
> >observed frequent costly high-order (order-6) page allocations triggered by
> >proc file reads from monitoring tools. This can be reproduced with a simple
> >test case:
> >
> >  fd = open(PROC_FILE, O_RDONLY);
> >  size = read(fd, buff, 256KB);
> >  close(fd);
> >
> >Although we should modify the monitoring tools to use smaller buffer sizes,
> >we should also enhance the kernel to prevent these expensive high-order
> >allocations.
> >
> >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >Cc: Josef Bacik <josef@toxicpanda.com>
> >---
> > fs/proc/proc_sysctl.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> >index cc9d74a06ff0..c53ba733bda5 100644
> >--- a/fs/proc/proc_sysctl.c
> >+++ b/fs/proc/proc_sysctl.c
> >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> > 	error = -ENOMEM;
> > 	if (count >= KMALLOC_MAX_SIZE)
> > 		goto out;
> >-	kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >+
> >+	/*
> >+	 * Use vmalloc if the count is too large to avoid costly high-order page
> >+	 * allocations.
> >+	 */
> >+	if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >+		kbuf = kvzalloc(count + 1, GFP_KERNEL);
> 
> Why not move this check into kvmalloc family?

Hmm should this check really be in kvmalloc family?

I don't think users would expect kvmalloc() to implictly decide on using
vmalloc() without trying kmalloc() first, just because it's a high-order
allocation.

> >+	else
> >+		kbuf = vmalloc(count + 1);
> 
> You dropped the zeroing. This must be vzalloc.
> 
> > 	if (!kbuf)
> > 		goto out;
> > 
> 
> Alternatively, why not force count to be <PAGE_SIZE? What uses >PAGE_SIZE writes in proc/sys?
> 
> -Kees
> 
> -- 
> Kees Cook
Yafang Shao April 2, 2025, 8:42 a.m. UTC | #4
On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> >
> >
> > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > >While investigating a kcompactd 100% CPU utilization issue in production, I
> > >observed frequent costly high-order (order-6) page allocations triggered by
> > >proc file reads from monitoring tools. This can be reproduced with a simple
> > >test case:
> > >
> > >  fd = open(PROC_FILE, O_RDONLY);
> > >  size = read(fd, buff, 256KB);
> > >  close(fd);
> > >
> > >Although we should modify the monitoring tools to use smaller buffer sizes,
> > >we should also enhance the kernel to prevent these expensive high-order
> > >allocations.
> > >
> > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > >Cc: Josef Bacik <josef@toxicpanda.com>
> > >---
> > > fs/proc/proc_sysctl.c | 10 +++++++++-
> > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> > >index cc9d74a06ff0..c53ba733bda5 100644
> > >--- a/fs/proc/proc_sysctl.c
> > >+++ b/fs/proc/proc_sysctl.c
> > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> > >     error = -ENOMEM;
> > >     if (count >= KMALLOC_MAX_SIZE)
> > >             goto out;
> > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > >+
> > >+    /*
> > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > >+     * allocations.
> > >+     */
> > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >
> > Why not move this check into kvmalloc family?
>
> Hmm should this check really be in kvmalloc family?

Modifying the existing kvmalloc functions risks performance regressions.
Could we instead introduce a new variant like vkmalloc() (favoring
vmalloc over kmalloc) or kvmalloc_costless()?

>
> I don't think users would expect kvmalloc() to implictly decide on using
> vmalloc() without trying kmalloc() first, just because it's a high-order
> allocation.
>
Vlastimil Babka April 2, 2025, 9:25 a.m. UTC | #5
On 4/2/25 10:42, Yafang Shao wrote:
> On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>>
>> On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
>> >
>> >
>> > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
>> > >While investigating a kcompactd 100% CPU utilization issue in production, I
>> > >observed frequent costly high-order (order-6) page allocations triggered by
>> > >proc file reads from monitoring tools. This can be reproduced with a simple
>> > >test case:
>> > >
>> > >  fd = open(PROC_FILE, O_RDONLY);
>> > >  size = read(fd, buff, 256KB);
>> > >  close(fd);
>> > >
>> > >Although we should modify the monitoring tools to use smaller buffer sizes,
>> > >we should also enhance the kernel to prevent these expensive high-order
>> > >allocations.
>> > >
>> > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>> > >Cc: Josef Bacik <josef@toxicpanda.com>
>> > >---
>> > > fs/proc/proc_sysctl.c | 10 +++++++++-
>> > > 1 file changed, 9 insertions(+), 1 deletion(-)
>> > >
>> > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
>> > >index cc9d74a06ff0..c53ba733bda5 100644
>> > >--- a/fs/proc/proc_sysctl.c
>> > >+++ b/fs/proc/proc_sysctl.c
>> > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
>> > >     error = -ENOMEM;
>> > >     if (count >= KMALLOC_MAX_SIZE)
>> > >             goto out;
>> > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
>> > >+
>> > >+    /*
>> > >+     * Use vmalloc if the count is too large to avoid costly high-order page
>> > >+     * allocations.
>> > >+     */
>> > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
>> > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
>> >
>> > Why not move this check into kvmalloc family?
>>
>> Hmm should this check really be in kvmalloc family?
> 
> Modifying the existing kvmalloc functions risks performance regressions.
> Could we instead introduce a new variant like vkmalloc() (favoring
> vmalloc over kmalloc) or kvmalloc_costless()?

We have gfp flags and kmalloc_gfp_adjust() to moderate how aggressive
kmalloc() is before the vmalloc() fallback. It does e.g.:

                if (!(flags & __GFP_RETRY_MAYFAIL))
                        flags |= __GFP_NORETRY;

However if your problem is kcompactd utilization then the kmalloc() attempt
would have to avoid ___GFP_KSWAPD_RECLAIM to avoid waking up kswapd and then
kcompactd. Should we remove the flag for costly orders? Dunno. Ideally the
deferred compaction mechanism would limit the issue in the first place.

The ad-hoc fixing up of a particular place (/proc files reading) or creating
a new vkmalloc() and then spreading its use as you see other places
triggering the issue seems quite suboptimal to me.

>>
>> I don't think users would expect kvmalloc() to implictly decide on using
>> vmalloc() without trying kmalloc() first, just because it's a high-order
>> allocation.
>>
>
Dave Chinner April 2, 2025, 11:32 a.m. UTC | #6
On Wed, Apr 02, 2025 at 04:42:06PM +0800, Yafang Shao wrote:
> On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> > >
> > >
> > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >While investigating a kcompactd 100% CPU utilization issue in production, I
> > > >observed frequent costly high-order (order-6) page allocations triggered by
> > > >proc file reads from monitoring tools. This can be reproduced with a simple
> > > >test case:
> > > >
> > > >  fd = open(PROC_FILE, O_RDONLY);
> > > >  size = read(fd, buff, 256KB);
> > > >  close(fd);
> > > >
> > > >Although we should modify the monitoring tools to use smaller buffer sizes,
> > > >we should also enhance the kernel to prevent these expensive high-order
> > > >allocations.
> > > >
> > > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > >Cc: Josef Bacik <josef@toxicpanda.com>
> > > >---
> > > > fs/proc/proc_sysctl.c | 10 +++++++++-
> > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> > > >index cc9d74a06ff0..c53ba733bda5 100644
> > > >--- a/fs/proc/proc_sysctl.c
> > > >+++ b/fs/proc/proc_sysctl.c
> > > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> > > >     error = -ENOMEM;
> > > >     if (count >= KMALLOC_MAX_SIZE)
> > > >             goto out;
> > > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > >+
> > > >+    /*
> > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > >+     * allocations.
> > > >+     */
> > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > >
> > > Why not move this check into kvmalloc family?
> >
> > Hmm should this check really be in kvmalloc family?
> 
> Modifying the existing kvmalloc functions risks performance regressions.
> Could we instead introduce a new variant like vkmalloc() (favoring
> vmalloc over kmalloc) or kvmalloc_costless()?

We should fix kvmalloc() instead of continuing to force
subsystems to work around the limitations of kvmalloc().

Have a look at xlog_kvmalloc() in XFS. It implements a basic
fast-fail, no retry high order kmalloc before it falls back to
vmalloc by turning off direct reclaim for the kmalloc() call.
Hence if the there isn't a high-order page on the free lists ready
to allocate, it falls back to vmalloc() immediately.

For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
overhead by around 80% when compared to a standard kvmalloc()
call. Numbers and profiles were documented in the commit message
(reproduced in whole below)...

> > I don't think users would expect kvmalloc() to implictly decide on using
> > vmalloc() without trying kmalloc() first, just because it's a high-order
> > allocation.

Right, but users expect kvmalloc() to use the most efficient
allocation paths available to it.

In this case, vmalloc() is faster and more reliable than
direct reclaim w/ compaction. Hence vmalloc() should really be the
primary fallback path when high-order pages are not immediately
available to kmalloc() when called from kvmalloc()...

-Dave.
Michal Hocko April 2, 2025, 12:17 p.m. UTC | #7
On Wed 02-04-25 11:25:12, Vlastimil Babka wrote:
> On 4/2/25 10:42, Yafang Shao wrote:
> > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >>
> >> On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> >> >
> >> >
> >> > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> >> > >While investigating a kcompactd 100% CPU utilization issue in production, I
> >> > >observed frequent costly high-order (order-6) page allocations triggered by
> >> > >proc file reads from monitoring tools. This can be reproduced with a simple
> >> > >test case:
> >> > >
> >> > >  fd = open(PROC_FILE, O_RDONLY);
> >> > >  size = read(fd, buff, 256KB);
> >> > >  close(fd);
> >> > >
> >> > >Although we should modify the monitoring tools to use smaller buffer sizes,
> >> > >we should also enhance the kernel to prevent these expensive high-order
> >> > >allocations.
> >> > >
> >> > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >> > >Cc: Josef Bacik <josef@toxicpanda.com>
> >> > >---
> >> > > fs/proc/proc_sysctl.c | 10 +++++++++-
> >> > > 1 file changed, 9 insertions(+), 1 deletion(-)
> >> > >
> >> > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> >> > >index cc9d74a06ff0..c53ba733bda5 100644
> >> > >--- a/fs/proc/proc_sysctl.c
> >> > >+++ b/fs/proc/proc_sysctl.c
> >> > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> >> > >     error = -ENOMEM;
> >> > >     if (count >= KMALLOC_MAX_SIZE)
> >> > >             goto out;
> >> > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >> > >+
> >> > >+    /*
> >> > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> >> > >+     * allocations.
> >> > >+     */
> >> > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >> > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >> >
> >> > Why not move this check into kvmalloc family?
> >>
> >> Hmm should this check really be in kvmalloc family?
> > 
> > Modifying the existing kvmalloc functions risks performance regressions.
> > Could we instead introduce a new variant like vkmalloc() (favoring
> > vmalloc over kmalloc) or kvmalloc_costless()?
> 
> We have gfp flags and kmalloc_gfp_adjust() to moderate how aggressive
> kmalloc() is before the vmalloc() fallback. It does e.g.:
> 
>                 if (!(flags & __GFP_RETRY_MAYFAIL))
>                         flags |= __GFP_NORETRY;
> 
> However if your problem is kcompactd utilization then the kmalloc() attempt
> would have to avoid ___GFP_KSWAPD_RECLAIM to avoid waking up kswapd and then
> kcompactd. Should we remove the flag for costly orders? Dunno. Ideally the
> deferred compaction mechanism would limit the issue in the first place.

Yes, triggering heavy compation for costly allocations seems to be quite
bad. We have GFP_RETRY_MAYFAIL for that purpose if the caller really
needs the allocation to try really hard.

> The ad-hoc fixing up of a particular place (/proc files reading) or creating
> a new vkmalloc() and then spreading its use as you see other places
> triggering the issue seems quite suboptimal to me.

Yes I absolutely agree.
Michal Hocko April 2, 2025, 12:24 p.m. UTC | #8
On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> On Wed, Apr 02, 2025 at 04:42:06PM +0800, Yafang Shao wrote:
> > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >
> > > On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> > > >
> > > >
> > > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >While investigating a kcompactd 100% CPU utilization issue in production, I
> > > > >observed frequent costly high-order (order-6) page allocations triggered by
> > > > >proc file reads from monitoring tools. This can be reproduced with a simple
> > > > >test case:
> > > > >
> > > > >  fd = open(PROC_FILE, O_RDONLY);
> > > > >  size = read(fd, buff, 256KB);
> > > > >  close(fd);
> > > > >
> > > > >Although we should modify the monitoring tools to use smaller buffer sizes,
> > > > >we should also enhance the kernel to prevent these expensive high-order
> > > > >allocations.
> > > > >
> > > > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > >Cc: Josef Bacik <josef@toxicpanda.com>
> > > > >---
> > > > > fs/proc/proc_sysctl.c | 10 +++++++++-
> > > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> > > > >index cc9d74a06ff0..c53ba733bda5 100644
> > > > >--- a/fs/proc/proc_sysctl.c
> > > > >+++ b/fs/proc/proc_sysctl.c
> > > > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> > > > >     error = -ENOMEM;
> > > > >     if (count >= KMALLOC_MAX_SIZE)
> > > > >             goto out;
> > > > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > > >+
> > > > >+    /*
> > > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > > >+     * allocations.
> > > > >+     */
> > > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > >
> > > > Why not move this check into kvmalloc family?
> > >
> > > Hmm should this check really be in kvmalloc family?
> > 
> > Modifying the existing kvmalloc functions risks performance regressions.
> > Could we instead introduce a new variant like vkmalloc() (favoring
> > vmalloc over kmalloc) or kvmalloc_costless()?
> 
> We should fix kvmalloc() instead of continuing to force
> subsystems to work around the limitations of kvmalloc().

Agreed!

> Have a look at xlog_kvmalloc() in XFS. It implements a basic
> fast-fail, no retry high order kmalloc before it falls back to
> vmalloc by turning off direct reclaim for the kmalloc() call.
> Hence if the there isn't a high-order page on the free lists ready
> to allocate, it falls back to vmalloc() immediately.
> 
> For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> overhead by around 80% when compared to a standard kvmalloc()
> call. Numbers and profiles were documented in the commit message
> (reproduced in whole below)...

Btw. it would be really great to have such concerns to be posted to the
linux-mm ML so that we are aware of that.

kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
to express - I prefer SLAB allocator over vmalloc. I think we could make
the default kvmalloc slab path weaker by default as those who really
want slab already have means to achieve that. There is a risk of long
term fragmentation but I think this is worth trying
diff --git a/mm/util.c b/mm/util.c
index 60aa40f612b8..8386f6976d7d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -601,14 +601,18 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
 	 * contribute to a long term fragmentation less than vmalloc fallback.
-	 * However make sure that larger requests are not too disruptive - no
-	 * OOM killer and no allocation failure warnings as we have a fallback.
+	 * However make sure that larger requests are not too disruptive - i.e.
+	 * do not direct reclaim unless physically continuous memory is preferred
+	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
+	 * working in the background but the allocation itself.
 	 */
 	if (size > PAGE_SIZE) {
 		flags |= __GFP_NOWARN;
 
 		if (!(flags & __GFP_RETRY_MAYFAIL))
 			flags |= __GFP_NORETRY;
+		else
+			flags &= ~__GFP_DIRECT_RECLAIM;
 
 		/* nofail semantic is implemented by the vmalloc fallback */
 		flags &= ~__GFP_NOFAIL;
Matthew Wilcox (Oracle) April 2, 2025, 5:24 p.m. UTC | #9
On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > > >+    /*
> > > > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > > > >+     * allocations.
> > > > > >+     */
> > > > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > > >
> > > > > Why not move this check into kvmalloc family?
> > > >
> > > > Hmm should this check really be in kvmalloc family?
> > > 
> > > Modifying the existing kvmalloc functions risks performance regressions.
> > > Could we instead introduce a new variant like vkmalloc() (favoring
> > > vmalloc over kmalloc) or kvmalloc_costless()?
> > 
> > We should fix kvmalloc() instead of continuing to force
> > subsystems to work around the limitations of kvmalloc().
> 
> Agreed!
> 
> > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > fast-fail, no retry high order kmalloc before it falls back to
> > vmalloc by turning off direct reclaim for the kmalloc() call.
> > Hence if the there isn't a high-order page on the free lists ready
> > to allocate, it falls back to vmalloc() immediately.

... but if vmalloc fails, it goes around again!  This is exactly why
we don't want filesystems implementing workarounds for MM problems.
What a mess.

>  	if (size > PAGE_SIZE) {
>  		flags |= __GFP_NOWARN;
>  
>  		if (!(flags & __GFP_RETRY_MAYFAIL))
>  			flags |= __GFP_NORETRY;
> +		else
> +			flags &= ~__GFP_DIRECT_RECLAIM;

I think it might be better to do this:

		flags |= __GFP_NOWARN;

		if (!(flags & __GFP_RETRY_MAYFAIL))
			flags |= __GFP_NORETRY;
+		else if (size > (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+			flags &= ~__GFP_DIRECT_RECLAIM;

I think it's entirely appropriate for a call to kvmalloc() to do
direct reclaim if it's asking for, say, 16KiB and we don't have any of
those available.  Better than exacerbating the fragmentation problem by
allocating 4x4KiB pages, each from different groupings.
Shakeel Butt April 2, 2025, 6:25 p.m. UTC | #10
On Wed, Apr 02, 2025 at 11:25:12AM +0200, Vlastimil Babka wrote:
> On 4/2/25 10:42, Yafang Shao wrote:
> > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >>
> >> On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> >> >
> >> >
> >> > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> >> > >While investigating a kcompactd 100% CPU utilization issue in production, I
> >> > >observed frequent costly high-order (order-6) page allocations triggered by
> >> > >proc file reads from monitoring tools. This can be reproduced with a simple
> >> > >test case:
> >> > >
> >> > >  fd = open(PROC_FILE, O_RDONLY);
> >> > >  size = read(fd, buff, 256KB);
> >> > >  close(fd);
> >> > >
> >> > >Although we should modify the monitoring tools to use smaller buffer sizes,
> >> > >we should also enhance the kernel to prevent these expensive high-order
> >> > >allocations.
> >> > >
> >> > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >> > >Cc: Josef Bacik <josef@toxicpanda.com>
> >> > >---
> >> > > fs/proc/proc_sysctl.c | 10 +++++++++-
> >> > > 1 file changed, 9 insertions(+), 1 deletion(-)
> >> > >
> >> > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> >> > >index cc9d74a06ff0..c53ba733bda5 100644
> >> > >--- a/fs/proc/proc_sysctl.c
> >> > >+++ b/fs/proc/proc_sysctl.c
> >> > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> >> > >     error = -ENOMEM;
> >> > >     if (count >= KMALLOC_MAX_SIZE)
> >> > >             goto out;
> >> > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >> > >+
> >> > >+    /*
> >> > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> >> > >+     * allocations.
> >> > >+     */
> >> > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >> > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >> >
> >> > Why not move this check into kvmalloc family?
> >>
> >> Hmm should this check really be in kvmalloc family?
> > 
> > Modifying the existing kvmalloc functions risks performance regressions.
> > Could we instead introduce a new variant like vkmalloc() (favoring
> > vmalloc over kmalloc) or kvmalloc_costless()?
> 
> We have gfp flags and kmalloc_gfp_adjust() to moderate how aggressive
> kmalloc() is before the vmalloc() fallback. It does e.g.:
> 
>                 if (!(flags & __GFP_RETRY_MAYFAIL))
>                         flags |= __GFP_NORETRY;
> 
> However if your problem is kcompactd utilization then the kmalloc() attempt
> would have to avoid ___GFP_KSWAPD_RECLAIM to avoid waking up kswapd and then
> kcompactd. Should we remove the flag for costly orders? Dunno.

Agree with the following points (i.e. ad-hoc fixing etc). The above
point of removing kswapd reclaim for costly orders need more thought.
Will we be hiding some compaction issues by doing so (i.e. no kswapd
reclaim for costly orders)?

> Ideally the
> deferred compaction mechanism would limit the issue in the first place.
> 
> The ad-hoc fixing up of a particular place (/proc files reading) or creating
> a new vkmalloc() and then spreading its use as you see other places
> triggering the issue seems quite suboptimal to me.
> 
> >>
> >> I don't think users would expect kvmalloc() to implictly decide on using
> >> vmalloc() without trying kmalloc() first, just because it's a high-order
> >> allocation.
> >>
> > 
>
Shakeel Butt April 2, 2025, 6:30 p.m. UTC | #11
On Wed, Apr 02, 2025 at 06:24:10PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > > > >+    /*
> > > > > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > > > > >+     * allocations.
> > > > > > >+     */
> > > > > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > > > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > > > >
> > > > > > Why not move this check into kvmalloc family?
> > > > >
> > > > > Hmm should this check really be in kvmalloc family?
> > > > 
> > > > Modifying the existing kvmalloc functions risks performance regressions.
> > > > Could we instead introduce a new variant like vkmalloc() (favoring
> > > > vmalloc over kmalloc) or kvmalloc_costless()?
> > > 
> > > We should fix kvmalloc() instead of continuing to force
> > > subsystems to work around the limitations of kvmalloc().
> > 
> > Agreed!
> > 
> > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > fast-fail, no retry high order kmalloc before it falls back to
> > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > Hence if the there isn't a high-order page on the free lists ready
> > > to allocate, it falls back to vmalloc() immediately.
> 
> ... but if vmalloc fails, it goes around again!  This is exactly why
> we don't want filesystems implementing workarounds for MM problems.
> What a mess.
> 
> >  	if (size > PAGE_SIZE) {
> >  		flags |= __GFP_NOWARN;
> >  
> >  		if (!(flags & __GFP_RETRY_MAYFAIL))
> >  			flags |= __GFP_NORETRY;
> > +		else
> > +			flags &= ~__GFP_DIRECT_RECLAIM;
> 
> I think it might be better to do this:
> 
> 		flags |= __GFP_NOWARN;
> 
> 		if (!(flags & __GFP_RETRY_MAYFAIL))
> 			flags |= __GFP_NORETRY;
> +		else if (size > (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> +			flags &= ~__GFP_DIRECT_RECLAIM;

The above seems more appropriate then the Michal's bigger hammer.
In addition I think Vlastimil has a very good point about the kswapd
reclaim for such cases (the patch explicitly complains about kcompactd
cpu usage).

> 
> I think it's entirely appropriate for a call to kvmalloc() to do
> direct reclaim if it's asking for, say, 16KiB and we don't have any of
> those available.  Better than exacerbating the fragmentation problem by
> allocating 4x4KiB pages, each from different groupings.
Dave Chinner April 2, 2025, 9:16 p.m. UTC | #12
On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > fast-fail, no retry high order kmalloc before it falls back to
> > vmalloc by turning off direct reclaim for the kmalloc() call.
> > Hence if the there isn't a high-order page on the free lists ready
> > to allocate, it falls back to vmalloc() immediately.
> > 
> > For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> > overhead by around 80% when compared to a standard kvmalloc()
> > call. Numbers and profiles were documented in the commit message
> > (reproduced in whole below)...
> 
> Btw. it would be really great to have such concerns to be posted to the
> linux-mm ML so that we are aware of that.

I have brought it up in the past, along with all the other kvmalloc
API problems that are mentioned in that commit message.
Unfortunately, discussion focus always ended up on calling context
and API flags (e.g. whether stuff like GFP_NOFS should be supported
or not) no the fast-fail-then-no-fail behaviour we need.

Yes, these discussions have resulted in API changes that support
some new subset of gfp flags, but the performance issues have never
been addressed...

> kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
> to express - I prefer SLAB allocator over vmalloc.

The conditional use of __GFP_NORETRY for the kmalloc call is broken
if we try to use __GFP_NOFAIL with kvmalloc() - this causes the gfp
mask to hold __GFP_NOFAIL | __GFP_NORETRY....

We have a hard requirement for xlog_kvmalloc() to provide
__GFP_NOFAIL semantics.

IOWs, we need kvmalloc() to support kmalloc(GFP_NOWAIT) for
performance with fallback to vmalloc(__GFP_NOFAIL) for
correctness...

> I think we could make
> the default kvmalloc slab path weaker by default as those who really
> want slab already have means to achieve that. There is a risk of long
> term fragmentation but I think this is worth trying

We've been doing this for a few years now in XFS in a hot path that
can make in the order of a million xlog_kvmalloc() calls a second.
We've not seen any evidence that this causes or exacerbates memory
fragmentation....

-Dave.
Dave Chinner April 2, 2025, 10:38 p.m. UTC | #13
On Wed, Apr 02, 2025 at 06:24:10PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > > > >+    /*
> > > > > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > > > > >+     * allocations.
> > > > > > >+     */
> > > > > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > > > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > > > >
> > > > > > Why not move this check into kvmalloc family?
> > > > >
> > > > > Hmm should this check really be in kvmalloc family?
> > > > 
> > > > Modifying the existing kvmalloc functions risks performance regressions.
> > > > Could we instead introduce a new variant like vkmalloc() (favoring
> > > > vmalloc over kmalloc) or kvmalloc_costless()?
> > > 
> > > We should fix kvmalloc() instead of continuing to force
> > > subsystems to work around the limitations of kvmalloc().
> > 
> > Agreed!
> > 
> > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > fast-fail, no retry high order kmalloc before it falls back to
> > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > Hence if the there isn't a high-order page on the free lists ready
> > > to allocate, it falls back to vmalloc() immediately.
> 
> ... but if vmalloc fails, it goes around again!  This is exactly why
> we don't want filesystems implementing workarounds for MM problems.
> What a mess.

That's because we need __GFP_NOFAIL semantics for the overall
operation, and we can't pass that to kvmalloc() because it doesn't
support __GFP_NOFAIL. And when this code was written, vmalloc didn't
support __GFP_NOFAIL, either. We *had* to open code nofail
semantics, because the mm infrastructure did not provide it.

Yes, we can fix this now that __vmalloc(__GFP_NOFAIL) is a thing.
We still need to open code the kmalloc() side of the operation right
now because....

> >  	if (size > PAGE_SIZE) {
> >  		flags |= __GFP_NOWARN;
> >  
> >  		if (!(flags & __GFP_RETRY_MAYFAIL))
> >  			flags |= __GFP_NORETRY;

.... this is a built-in catch-22.

If we use kvmalloc(__GFP_NOFAIL), this code results in kmalloc
with __GFP_NORETRY | __GFP_NOFAIL flags set. i.e. we are telling
the allocation that it must not retry but it also must retry until
it succeeds.

To work around this, the caller then has to use __GFP_RETRY_MAYFAIL
| __GFP_NOFAIL, which is telling the allocation that it is allowed
to fail but it also must not fail. Again, this makes no sense at
all, and on top of that it doesn't give us fast-fail semantics
we want from the kmalloc side of kvmalloc.

i.e. high order page allocation from kmalloc() is an optimisation,
not a requirement for kvmalloc(). If high order page allocation is
frequently more expensive than simply falling back to vmalloc(),
then we've made the wrong optimisation choices for the kvmalloc()
implementation...

> I think it might be better to do this:
> 
> 		flags |= __GFP_NOWARN;
> 
> 		if (!(flags & __GFP_RETRY_MAYFAIL))
> 			flags |= __GFP_NORETRY;
> +		else if (size > (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> +			flags &= ~__GFP_DIRECT_RECLAIM;
> 
> I think it's entirely appropriate for a call to kvmalloc() to do
> direct reclaim if it's asking for, say, 16KiB and we don't have any of
> those available.

I disagree - we have background compaction to address the lack of
high order folios in the allocator reserves. Let that do the work of
resolving the internal resource shortage instead of slowing down
allocations that *do not require high order pages to be allocated*.

> Better than exacerbating the fragmentation problem by
> allocating 4x4KiB pages, each from different groupings.

We have no evidence that this allocation behaviour in XFS causes or
exacerbates memory fragmentation. We have been running it in
production systems for a few years now....

-Dave.
Shakeel Butt April 2, 2025, 11:10 p.m. UTC | #14
On Thu, Apr 03, 2025 at 08:16:56AM +1100, Dave Chinner wrote:
> On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > fast-fail, no retry high order kmalloc before it falls back to
> > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > Hence if the there isn't a high-order page on the free lists ready
> > > to allocate, it falls back to vmalloc() immediately.
> > > 
> > > For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> > > overhead by around 80% when compared to a standard kvmalloc()
> > > call. Numbers and profiles were documented in the commit message
> > > (reproduced in whole below)...
> > 
> > Btw. it would be really great to have such concerns to be posted to the
> > linux-mm ML so that we are aware of that.
> 
> I have brought it up in the past, along with all the other kvmalloc
> API problems that are mentioned in that commit message.
> Unfortunately, discussion focus always ended up on calling context
> and API flags (e.g. whether stuff like GFP_NOFS should be supported
> or not) no the fast-fail-then-no-fail behaviour we need.
> 
> Yes, these discussions have resulted in API changes that support
> some new subset of gfp flags, but the performance issues have never
> been addressed...
> 
> > kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
> > to express - I prefer SLAB allocator over vmalloc.
> 
> The conditional use of __GFP_NORETRY for the kmalloc call is broken
> if we try to use __GFP_NOFAIL with kvmalloc() - this causes the gfp
> mask to hold __GFP_NOFAIL | __GFP_NORETRY....
> 
> We have a hard requirement for xlog_kvmalloc() to provide
> __GFP_NOFAIL semantics.
> 
> IOWs, we need kvmalloc() to support kmalloc(GFP_NOWAIT) for
> performance with fallback to vmalloc(__GFP_NOFAIL) for
> correctness...
> 

Are you asking the above kvmalloc() semantics just for xfs or for all
the users of kvmalloc() api? 

> > I think we could make
> > the default kvmalloc slab path weaker by default as those who really
> > want slab already have means to achieve that. There is a risk of long
> > term fragmentation but I think this is worth trying
> 
> We've been doing this for a few years now in XFS in a hot path that
> can make in the order of a million xlog_kvmalloc() calls a second.
> We've not seen any evidence that this causes or exacerbates memory
> fragmentation....
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index cc9d74a06ff0..c53ba733bda5 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -581,7 +581,15 @@  static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
 	error = -ENOMEM;
 	if (count >= KMALLOC_MAX_SIZE)
 		goto out;
-	kbuf = kvzalloc(count + 1, GFP_KERNEL);
+
+	/*
+	 * Use vmalloc if the count is too large to avoid costly high-order page
+	 * allocations.
+	 */
+	if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+		kbuf = kvzalloc(count + 1, GFP_KERNEL);
+	else
+		kbuf = vmalloc(count + 1);
 	if (!kbuf)
 		goto out;