Message ID | 20250401073046.51121-1-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | proc: Avoid costly high-order page allocations when reading proc files | expand |
On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote: >While investigating a kcompactd 100% CPU utilization issue in production, I >observed frequent costly high-order (order-6) page allocations triggered by >proc file reads from monitoring tools. This can be reproduced with a simple >test case: > > fd = open(PROC_FILE, O_RDONLY); > size = read(fd, buff, 256KB); > close(fd); > >Although we should modify the monitoring tools to use smaller buffer sizes, >we should also enhance the kernel to prevent these expensive high-order >allocations. > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com> >Cc: Josef Bacik <josef@toxicpanda.com> >--- > fs/proc/proc_sysctl.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c >index cc9d74a06ff0..c53ba733bda5 100644 >--- a/fs/proc/proc_sysctl.c >+++ b/fs/proc/proc_sysctl.c >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > error = -ENOMEM; > if (count >= KMALLOC_MAX_SIZE) > goto out; >- kbuf = kvzalloc(count + 1, GFP_KERNEL); >+ >+ /* >+ * Use vmalloc if the count is too large to avoid costly high-order page >+ * allocations. >+ */ >+ if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) >+ kbuf = kvzalloc(count + 1, GFP_KERNEL); Why not move this check into kvmalloc family? >+ else >+ kbuf = vmalloc(count + 1); You dropped the zeroing. This must be vzalloc. > if (!kbuf) > goto out; > Alternatively, why not force count to be <PAGE_SIZE? What uses >PAGE_SIZE writes in proc/sys? -Kees
On Tue, Apr 1, 2025 at 10:01 PM Kees Cook <kees@kernel.org> wrote: > > > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote: > >While investigating a kcompactd 100% CPU utilization issue in production, I > >observed frequent costly high-order (order-6) page allocations triggered by > >proc file reads from monitoring tools. This can be reproduced with a simple > >test case: > > > > fd = open(PROC_FILE, O_RDONLY); > > size = read(fd, buff, 256KB); > > close(fd); > > > >Although we should modify the monitoring tools to use smaller buffer sizes, > >we should also enhance the kernel to prevent these expensive high-order > >allocations. > > > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > >Cc: Josef Bacik <josef@toxicpanda.com> > >--- > > fs/proc/proc_sysctl.c | 10 +++++++++- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c > >index cc9d74a06ff0..c53ba733bda5 100644 > >--- a/fs/proc/proc_sysctl.c > >+++ b/fs/proc/proc_sysctl.c > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > > error = -ENOMEM; > > if (count >= KMALLOC_MAX_SIZE) > > goto out; > >- kbuf = kvzalloc(count + 1, GFP_KERNEL); > >+ > >+ /* > >+ * Use vmalloc if the count is too large to avoid costly high-order page > >+ * allocations. > >+ */ > >+ if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) > >+ kbuf = kvzalloc(count + 1, GFP_KERNEL); > > Why not move this check into kvmalloc family? good suggestion. > > >+ else > >+ kbuf = vmalloc(count + 1); > > You dropped the zeroing. This must be vzalloc. Nice catch. > > > if (!kbuf) > > goto out; > > > > Alternatively, why not force count to be <PAGE_SIZE? What uses >PAGE_SIZE writes in proc/sys? This would break backward compatibility with existing tools, so we cannot enforce this restriction.
On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote: > > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote: > >While investigating a kcompactd 100% CPU utilization issue in production, I > >observed frequent costly high-order (order-6) page allocations triggered by > >proc file reads from monitoring tools. This can be reproduced with a simple > >test case: > > > > fd = open(PROC_FILE, O_RDONLY); > > size = read(fd, buff, 256KB); > > close(fd); > > > >Although we should modify the monitoring tools to use smaller buffer sizes, > >we should also enhance the kernel to prevent these expensive high-order > >allocations. > > > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > >Cc: Josef Bacik <josef@toxicpanda.com> > >--- > > fs/proc/proc_sysctl.c | 10 +++++++++- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c > >index cc9d74a06ff0..c53ba733bda5 100644 > >--- a/fs/proc/proc_sysctl.c > >+++ b/fs/proc/proc_sysctl.c > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > > error = -ENOMEM; > > if (count >= KMALLOC_MAX_SIZE) > > goto out; > >- kbuf = kvzalloc(count + 1, GFP_KERNEL); > >+ > >+ /* > >+ * Use vmalloc if the count is too large to avoid costly high-order page > >+ * allocations. > >+ */ > >+ if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) > >+ kbuf = kvzalloc(count + 1, GFP_KERNEL); > > Why not move this check into kvmalloc family? Hmm should this check really be in kvmalloc family? I don't think users would expect kvmalloc() to implictly decide on using vmalloc() without trying kmalloc() first, just because it's a high-order allocation. > >+ else > >+ kbuf = vmalloc(count + 1); > > You dropped the zeroing. This must be vzalloc. > > > if (!kbuf) > > goto out; > > > > Alternatively, why not force count to be <PAGE_SIZE? What uses >PAGE_SIZE writes in proc/sys? > > -Kees > > -- > Kees Cook
On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote: > > On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote: > > > > > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote: > > >While investigating a kcompactd 100% CPU utilization issue in production, I > > >observed frequent costly high-order (order-6) page allocations triggered by > > >proc file reads from monitoring tools. This can be reproduced with a simple > > >test case: > > > > > > fd = open(PROC_FILE, O_RDONLY); > > > size = read(fd, buff, 256KB); > > > close(fd); > > > > > >Although we should modify the monitoring tools to use smaller buffer sizes, > > >we should also enhance the kernel to prevent these expensive high-order > > >allocations. > > > > > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > > >Cc: Josef Bacik <josef@toxicpanda.com> > > >--- > > > fs/proc/proc_sysctl.c | 10 +++++++++- > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c > > >index cc9d74a06ff0..c53ba733bda5 100644 > > >--- a/fs/proc/proc_sysctl.c > > >+++ b/fs/proc/proc_sysctl.c > > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > > > error = -ENOMEM; > > > if (count >= KMALLOC_MAX_SIZE) > > > goto out; > > >- kbuf = kvzalloc(count + 1, GFP_KERNEL); > > >+ > > >+ /* > > >+ * Use vmalloc if the count is too large to avoid costly high-order page > > >+ * allocations. > > >+ */ > > >+ if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) > > >+ kbuf = kvzalloc(count + 1, GFP_KERNEL); > > > > Why not move this check into kvmalloc family? > > Hmm should this check really be in kvmalloc family? Modifying the existing kvmalloc functions risks performance regressions. Could we instead introduce a new variant like vkmalloc() (favoring vmalloc over kmalloc) or kvmalloc_costless()? > > I don't think users would expect kvmalloc() to implictly decide on using > vmalloc() without trying kmalloc() first, just because it's a high-order > allocation. >
On 4/2/25 10:42, Yafang Shao wrote: > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote: >> >> On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote: >> > >> > >> > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote: >> > >While investigating a kcompactd 100% CPU utilization issue in production, I >> > >observed frequent costly high-order (order-6) page allocations triggered by >> > >proc file reads from monitoring tools. This can be reproduced with a simple >> > >test case: >> > > >> > > fd = open(PROC_FILE, O_RDONLY); >> > > size = read(fd, buff, 256KB); >> > > close(fd); >> > > >> > >Although we should modify the monitoring tools to use smaller buffer sizes, >> > >we should also enhance the kernel to prevent these expensive high-order >> > >allocations. >> > > >> > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com> >> > >Cc: Josef Bacik <josef@toxicpanda.com> >> > >--- >> > > fs/proc/proc_sysctl.c | 10 +++++++++- >> > > 1 file changed, 9 insertions(+), 1 deletion(-) >> > > >> > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c >> > >index cc9d74a06ff0..c53ba733bda5 100644 >> > >--- a/fs/proc/proc_sysctl.c >> > >+++ b/fs/proc/proc_sysctl.c >> > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, >> > > error = -ENOMEM; >> > > if (count >= KMALLOC_MAX_SIZE) >> > > goto out; >> > >- kbuf = kvzalloc(count + 1, GFP_KERNEL); >> > >+ >> > >+ /* >> > >+ * Use vmalloc if the count is too large to avoid costly high-order page >> > >+ * allocations. >> > >+ */ >> > >+ if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) >> > >+ kbuf = kvzalloc(count + 1, GFP_KERNEL); >> > >> > Why not move this check into kvmalloc family? >> >> Hmm should this check really be in kvmalloc family? > > Modifying the existing kvmalloc functions risks performance regressions. > Could we instead introduce a new variant like vkmalloc() (favoring > vmalloc over kmalloc) or kvmalloc_costless()? We have gfp flags and kmalloc_gfp_adjust() to moderate how aggressive kmalloc() is before the vmalloc() fallback. It does e.g.: if (!(flags & __GFP_RETRY_MAYFAIL)) flags |= __GFP_NORETRY; However if your problem is kcompactd utilization then the kmalloc() attempt would have to avoid ___GFP_KSWAPD_RECLAIM to avoid waking up kswapd and then kcompactd. Should we remove the flag for costly orders? Dunno. Ideally the deferred compaction mechanism would limit the issue in the first place. The ad-hoc fixing up of a particular place (/proc files reading) or creating a new vkmalloc() and then spreading its use as you see other places triggering the issue seems quite suboptimal to me. >> >> I don't think users would expect kvmalloc() to implictly decide on using >> vmalloc() without trying kmalloc() first, just because it's a high-order >> allocation. >> >
On Wed, Apr 02, 2025 at 04:42:06PM +0800, Yafang Shao wrote: > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote: > > > > On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote: > > > > > > > > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote: > > > >While investigating a kcompactd 100% CPU utilization issue in production, I > > > >observed frequent costly high-order (order-6) page allocations triggered by > > > >proc file reads from monitoring tools. This can be reproduced with a simple > > > >test case: > > > > > > > > fd = open(PROC_FILE, O_RDONLY); > > > > size = read(fd, buff, 256KB); > > > > close(fd); > > > > > > > >Although we should modify the monitoring tools to use smaller buffer sizes, > > > >we should also enhance the kernel to prevent these expensive high-order > > > >allocations. > > > > > > > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > > > >Cc: Josef Bacik <josef@toxicpanda.com> > > > >--- > > > > fs/proc/proc_sysctl.c | 10 +++++++++- > > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c > > > >index cc9d74a06ff0..c53ba733bda5 100644 > > > >--- a/fs/proc/proc_sysctl.c > > > >+++ b/fs/proc/proc_sysctl.c > > > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > > > > error = -ENOMEM; > > > > if (count >= KMALLOC_MAX_SIZE) > > > > goto out; > > > >- kbuf = kvzalloc(count + 1, GFP_KERNEL); > > > >+ > > > >+ /* > > > >+ * Use vmalloc if the count is too large to avoid costly high-order page > > > >+ * allocations. > > > >+ */ > > > >+ if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) > > > >+ kbuf = kvzalloc(count + 1, GFP_KERNEL); > > > > > > Why not move this check into kvmalloc family? > > > > Hmm should this check really be in kvmalloc family? > > Modifying the existing kvmalloc functions risks performance regressions. > Could we instead introduce a new variant like vkmalloc() (favoring > vmalloc over kmalloc) or kvmalloc_costless()? We should fix kvmalloc() instead of continuing to force subsystems to work around the limitations of kvmalloc(). Have a look at xlog_kvmalloc() in XFS. It implements a basic fast-fail, no retry high order kmalloc before it falls back to vmalloc by turning off direct reclaim for the kmalloc() call. Hence if the there isn't a high-order page on the free lists ready to allocate, it falls back to vmalloc() immediately. For XFS, using xlog_kvmalloc() reduced the high-order per-allocation overhead by around 80% when compared to a standard kvmalloc() call. Numbers and profiles were documented in the commit message (reproduced in whole below)... > > I don't think users would expect kvmalloc() to implictly decide on using > > vmalloc() without trying kmalloc() first, just because it's a high-order > > allocation. Right, but users expect kvmalloc() to use the most efficient allocation paths available to it. In this case, vmalloc() is faster and more reliable than direct reclaim w/ compaction. Hence vmalloc() should really be the primary fallback path when high-order pages are not immediately available to kmalloc() when called from kvmalloc()... -Dave.
On Wed 02-04-25 11:25:12, Vlastimil Babka wrote: > On 4/2/25 10:42, Yafang Shao wrote: > > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote: > >> > >> On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote: > >> > > >> > > >> > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote: > >> > >While investigating a kcompactd 100% CPU utilization issue in production, I > >> > >observed frequent costly high-order (order-6) page allocations triggered by > >> > >proc file reads from monitoring tools. This can be reproduced with a simple > >> > >test case: > >> > > > >> > > fd = open(PROC_FILE, O_RDONLY); > >> > > size = read(fd, buff, 256KB); > >> > > close(fd); > >> > > > >> > >Although we should modify the monitoring tools to use smaller buffer sizes, > >> > >we should also enhance the kernel to prevent these expensive high-order > >> > >allocations. > >> > > > >> > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > >> > >Cc: Josef Bacik <josef@toxicpanda.com> > >> > >--- > >> > > fs/proc/proc_sysctl.c | 10 +++++++++- > >> > > 1 file changed, 9 insertions(+), 1 deletion(-) > >> > > > >> > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c > >> > >index cc9d74a06ff0..c53ba733bda5 100644 > >> > >--- a/fs/proc/proc_sysctl.c > >> > >+++ b/fs/proc/proc_sysctl.c > >> > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > >> > > error = -ENOMEM; > >> > > if (count >= KMALLOC_MAX_SIZE) > >> > > goto out; > >> > >- kbuf = kvzalloc(count + 1, GFP_KERNEL); > >> > >+ > >> > >+ /* > >> > >+ * Use vmalloc if the count is too large to avoid costly high-order page > >> > >+ * allocations. > >> > >+ */ > >> > >+ if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) > >> > >+ kbuf = kvzalloc(count + 1, GFP_KERNEL); > >> > > >> > Why not move this check into kvmalloc family? > >> > >> Hmm should this check really be in kvmalloc family? > > > > Modifying the existing kvmalloc functions risks performance regressions. > > Could we instead introduce a new variant like vkmalloc() (favoring > > vmalloc over kmalloc) or kvmalloc_costless()? > > We have gfp flags and kmalloc_gfp_adjust() to moderate how aggressive > kmalloc() is before the vmalloc() fallback. It does e.g.: > > if (!(flags & __GFP_RETRY_MAYFAIL)) > flags |= __GFP_NORETRY; > > However if your problem is kcompactd utilization then the kmalloc() attempt > would have to avoid ___GFP_KSWAPD_RECLAIM to avoid waking up kswapd and then > kcompactd. Should we remove the flag for costly orders? Dunno. Ideally the > deferred compaction mechanism would limit the issue in the first place. Yes, triggering heavy compation for costly allocations seems to be quite bad. We have GFP_RETRY_MAYFAIL for that purpose if the caller really needs the allocation to try really hard. > The ad-hoc fixing up of a particular place (/proc files reading) or creating > a new vkmalloc() and then spreading its use as you see other places > triggering the issue seems quite suboptimal to me. Yes I absolutely agree.
On Wed 02-04-25 22:32:14, Dave Chinner wrote: > On Wed, Apr 02, 2025 at 04:42:06PM +0800, Yafang Shao wrote: > > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote: > > > > > > On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote: > > > > > > > > > > > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote: > > > > >While investigating a kcompactd 100% CPU utilization issue in production, I > > > > >observed frequent costly high-order (order-6) page allocations triggered by > > > > >proc file reads from monitoring tools. This can be reproduced with a simple > > > > >test case: > > > > > > > > > > fd = open(PROC_FILE, O_RDONLY); > > > > > size = read(fd, buff, 256KB); > > > > > close(fd); > > > > > > > > > >Although we should modify the monitoring tools to use smaller buffer sizes, > > > > >we should also enhance the kernel to prevent these expensive high-order > > > > >allocations. > > > > > > > > > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > > > > >Cc: Josef Bacik <josef@toxicpanda.com> > > > > >--- > > > > > fs/proc/proc_sysctl.c | 10 +++++++++- > > > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > > > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c > > > > >index cc9d74a06ff0..c53ba733bda5 100644 > > > > >--- a/fs/proc/proc_sysctl.c > > > > >+++ b/fs/proc/proc_sysctl.c > > > > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > > > > > error = -ENOMEM; > > > > > if (count >= KMALLOC_MAX_SIZE) > > > > > goto out; > > > > >- kbuf = kvzalloc(count + 1, GFP_KERNEL); > > > > >+ > > > > >+ /* > > > > >+ * Use vmalloc if the count is too large to avoid costly high-order page > > > > >+ * allocations. > > > > >+ */ > > > > >+ if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) > > > > >+ kbuf = kvzalloc(count + 1, GFP_KERNEL); > > > > > > > > Why not move this check into kvmalloc family? > > > > > > Hmm should this check really be in kvmalloc family? > > > > Modifying the existing kvmalloc functions risks performance regressions. > > Could we instead introduce a new variant like vkmalloc() (favoring > > vmalloc over kmalloc) or kvmalloc_costless()? > > We should fix kvmalloc() instead of continuing to force > subsystems to work around the limitations of kvmalloc(). Agreed! > Have a look at xlog_kvmalloc() in XFS. It implements a basic > fast-fail, no retry high order kmalloc before it falls back to > vmalloc by turning off direct reclaim for the kmalloc() call. > Hence if the there isn't a high-order page on the free lists ready > to allocate, it falls back to vmalloc() immediately. > > For XFS, using xlog_kvmalloc() reduced the high-order per-allocation > overhead by around 80% when compared to a standard kvmalloc() > call. Numbers and profiles were documented in the commit message > (reproduced in whole below)... Btw. it would be really great to have such concerns to be posted to the linux-mm ML so that we are aware of that. kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow to express - I prefer SLAB allocator over vmalloc. I think we could make the default kvmalloc slab path weaker by default as those who really want slab already have means to achieve that. There is a risk of long term fragmentation but I think this is worth trying diff --git a/mm/util.c b/mm/util.c index 60aa40f612b8..8386f6976d7d 100644 --- a/mm/util.c +++ b/mm/util.c @@ -601,14 +601,18 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size) * We want to attempt a large physically contiguous block first because * it is less likely to fragment multiple larger blocks and therefore * contribute to a long term fragmentation less than vmalloc fallback. - * However make sure that larger requests are not too disruptive - no - * OOM killer and no allocation failure warnings as we have a fallback. + * However make sure that larger requests are not too disruptive - i.e. + * do not direct reclaim unless physically continuous memory is preferred + * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start + * working in the background but the allocation itself. */ if (size > PAGE_SIZE) { flags |= __GFP_NOWARN; if (!(flags & __GFP_RETRY_MAYFAIL)) flags |= __GFP_NORETRY; + else + flags &= ~__GFP_DIRECT_RECLAIM; /* nofail semantic is implemented by the vmalloc fallback */ flags &= ~__GFP_NOFAIL;
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index cc9d74a06ff0..c53ba733bda5 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, error = -ENOMEM; if (count >= KMALLOC_MAX_SIZE) goto out; - kbuf = kvzalloc(count + 1, GFP_KERNEL); + + /* + * Use vmalloc if the count is too large to avoid costly high-order page + * allocations. + */ + if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) + kbuf = kvzalloc(count + 1, GFP_KERNEL); + else + kbuf = vmalloc(count + 1); if (!kbuf) goto out;
While investigating a kcompactd 100% CPU utilization issue in production, I observed frequent costly high-order (order-6) page allocations triggered by proc file reads from monitoring tools. This can be reproduced with a simple test case: fd = open(PROC_FILE, O_RDONLY); size = read(fd, buff, 256KB); close(fd); Although we should modify the monitoring tools to use smaller buffer sizes, we should also enhance the kernel to prevent these expensive high-order allocations. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> --- fs/proc/proc_sysctl.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)