diff mbox series

[v2] mm: alloc_pages_bulk: remove assumption of populating only NULL elements

Message ID 20250228094424.757465-1-linyunsheng@huawei.com (mailing list archive)
State New
Headers show
Series [v2] mm: alloc_pages_bulk: remove assumption of populating only NULL elements | expand

Commit Message

Yunsheng Lin Feb. 28, 2025, 9:44 a.m. UTC
As mentioned in [1], it seems odd to check NULL elements in
the middle of page bulk allocating, and it seems caller can
do a better job of bulk allocating pages into a whole array
sequentially without checking NULL elements first before
doing the page bulk allocation for most of existing users.

Through analyzing of bulk allocation API used in fs, it
seems that the callers are depending on the assumption of
populating only NULL elements in fs/btrfs/extent_io.c and
net/sunrpc/svc_xprt.c while erofs and btrfs don't, see:
commit 91d6ac1d62c3 ("btrfs: allocate page arrays using bulk page allocator")
commit d6db47e571dc ("erofs: do not use pagepool in z_erofs_gbuf_growsize()")
commit c9fa563072e1 ("xfs: use alloc_pages_bulk_array() for buffers")
commit f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")

Change SUNRPC and btrfs to not depend on the assumption.
Other existing callers seems to be passing all NULL elements
via memset, kzalloc, etc.

Remove assumption of populating only NULL elements and treat
page_array as output parameter like kmem_cache_alloc_bulk().
Remove the above assumption also enable the caller to not
zero the array before calling the page bulk allocating API,
which has about 1~2 ns performance improvement for the test
case of time_bench_page_pool03_slow() for page_pool in a
x86 vm system, this reduces some performance impact of
fixing the DMA API misuse problem in [2], performance
improves from 87.886 ns to 86.429 ns.

1. https://lore.kernel.org/all/bd8c2f5c-464d-44ab-b607-390a87ea4cd5@huawei.com/
2. https://lore.kernel.org/all/20250212092552.1779679-1-linyunsheng@huawei.com/
CC: Jesper Dangaard Brouer <hawk@kernel.org>
CC: Luiz Capitulino <luizcap@redhat.com>
CC: Mel Gorman <mgorman@techsingularity.net>
CC: Dave Chinner <david@fromorbit.com>
CC: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
---
V2:
1. Drop RFC tag and rebased on latest linux-next.
2. Fix a compile error for xfs.
3. Defragmemt the page_array for SUNRPC and btrfs.
---
 drivers/vfio/pci/virtio/migrate.c |  2 --
 fs/btrfs/extent_io.c              | 23 +++++++++++++++++-----
 fs/erofs/zutil.c                  | 12 ++++++------
 fs/xfs/xfs_buf.c                  |  9 +++++----
 mm/page_alloc.c                   | 32 +++++--------------------------
 net/core/page_pool.c              |  3 ---
 net/sunrpc/svc_xprt.c             | 22 +++++++++++++++++----
 7 files changed, 52 insertions(+), 51 deletions(-)

Comments

Chuck Lever March 3, 2025, 10:13 p.m. UTC | #1
On 2/28/25 4:44 AM, Yunsheng Lin wrote:
> As mentioned in [1], it seems odd to check NULL elements in
> the middle of page bulk allocating, and it seems caller can
> do a better job of bulk allocating pages into a whole array
> sequentially without checking NULL elements first before
> doing the page bulk allocation for most of existing users.

Sorry, but this still makes a claim without providing any data
to back it up. Why can callers "do a better job"?


> Through analyzing of bulk allocation API used in fs, it
> seems that the callers are depending on the assumption of
> populating only NULL elements in fs/btrfs/extent_io.c and
> net/sunrpc/svc_xprt.c while erofs and btrfs don't, see:
> commit 91d6ac1d62c3 ("btrfs: allocate page arrays using bulk page allocator")
> commit d6db47e571dc ("erofs: do not use pagepool in z_erofs_gbuf_growsize()")
> commit c9fa563072e1 ("xfs: use alloc_pages_bulk_array() for buffers")
> commit f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
> 
> Change SUNRPC and btrfs to not depend on the assumption.
> Other existing callers seems to be passing all NULL elements
> via memset, kzalloc, etc.
> 
> Remove assumption of populating only NULL elements and treat
> page_array as output parameter like kmem_cache_alloc_bulk().
> Remove the above assumption also enable the caller to not
> zero the array before calling the page bulk allocating API,
> which has about 1~2 ns performance improvement for the test
> case of time_bench_page_pool03_slow() for page_pool in a
> x86 vm system, this reduces some performance impact of
> fixing the DMA API misuse problem in [2], performance
> improves from 87.886 ns to 86.429 ns.
> 
> 1. https://lore.kernel.org/all/bd8c2f5c-464d-44ab-b607-390a87ea4cd5@huawei.com/
> 2. https://lore.kernel.org/all/20250212092552.1779679-1-linyunsheng@huawei.com/
> CC: Jesper Dangaard Brouer <hawk@kernel.org>
> CC: Luiz Capitulino <luizcap@redhat.com>
> CC: Mel Gorman <mgorman@techsingularity.net>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Acked-by: Jeff Layton <jlayton@kernel.org>
> ---
> V2:
> 1. Drop RFC tag and rebased on latest linux-next.
> 2. Fix a compile error for xfs.
> 3. Defragmemt the page_array for SUNRPC and btrfs.
> ---
>  drivers/vfio/pci/virtio/migrate.c |  2 --
>  fs/btrfs/extent_io.c              | 23 +++++++++++++++++-----
>  fs/erofs/zutil.c                  | 12 ++++++------
>  fs/xfs/xfs_buf.c                  |  9 +++++----
>  mm/page_alloc.c                   | 32 +++++--------------------------
>  net/core/page_pool.c              |  3 ---
>  net/sunrpc/svc_xprt.c             | 22 +++++++++++++++++----
>  7 files changed, 52 insertions(+), 51 deletions(-)

52:51 is not an improvement. 1-2 ns is barely worth mentioning. The
sunrpc and btrfs callers are more complex and carry duplicated code.

Not an outright objection from me, but it's hard to justify this change.


> diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
> index ba92bb4e9af9..9f003a237dec 100644
> --- a/drivers/vfio/pci/virtio/migrate.c
> +++ b/drivers/vfio/pci/virtio/migrate.c
> @@ -91,8 +91,6 @@ static int virtiovf_add_migration_pages(struct virtiovf_data_buffer *buf,
>  		if (ret)
>  			goto err_append;
>  		buf->allocated_length += filled * PAGE_SIZE;
> -		/* clean input for another bulk allocation */
> -		memset(page_list, 0, filled * sizeof(*page_list));
>  		to_fill = min_t(unsigned int, to_alloc,
>  				PAGE_SIZE / sizeof(*page_list));
>  	} while (to_alloc > 0);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index f0a1da40d641..ef52cedd9873 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -623,13 +623,26 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
>  			   bool nofail)
>  {
>  	const gfp_t gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
> -	unsigned int allocated;
> +	unsigned int allocated, ret;
>  
> -	for (allocated = 0; allocated < nr_pages;) {
> -		unsigned int last = allocated;
> +	/* Defragment page_array so pages can be bulk allocated into remaining
> +	 * NULL elements sequentially.
> +	 */
> +	for (allocated = 0, ret = 0; ret < nr_pages; ret++) {
> +		if (page_array[ret]) {
> +			page_array[allocated] = page_array[ret];
> +			if (ret != allocated)
> +				page_array[ret] = NULL;
> +
> +			allocated++;
> +		}
> +	}
>  
> -		allocated = alloc_pages_bulk(gfp, nr_pages, page_array);
> -		if (unlikely(allocated == last)) {
> +	while (allocated < nr_pages) {
> +		ret = alloc_pages_bulk(gfp, nr_pages - allocated,
> +				       page_array + allocated);
> +		allocated += ret;
> +		if (unlikely(!ret)) {
>  			/* No progress, fail and do cleanup. */
>  			for (int i = 0; i < allocated; i++) {
>  				__free_page(page_array[i]);
> diff --git a/fs/erofs/zutil.c b/fs/erofs/zutil.c
> index 55ff2ab5128e..1c50b5e27371 100644
> --- a/fs/erofs/zutil.c
> +++ b/fs/erofs/zutil.c
> @@ -85,13 +85,13 @@ int z_erofs_gbuf_growsize(unsigned int nrpages)
>  
>  		for (j = 0; j < gbuf->nrpages; ++j)
>  			tmp_pages[j] = gbuf->pages[j];
> -		do {
> -			last = j;
> -			j = alloc_pages_bulk(GFP_KERNEL, nrpages,
> -					     tmp_pages);
> -			if (last == j)
> +
> +		for (last = j; last < nrpages; last += j) {
> +			j = alloc_pages_bulk(GFP_KERNEL, nrpages - last,
> +					     tmp_pages + last);
> +			if (!j)
>  				goto out;
> -		} while (j != nrpages);
> +		}
>  
>  		ptr = vmap(tmp_pages, nrpages, VM_MAP, PAGE_KERNEL);
>  		if (!ptr)
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 5d560e9073f4..b4e95b2dd0f0 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -319,16 +319,17 @@ xfs_buf_alloc_pages(
>  	 * least one extra page.
>  	 */
>  	for (;;) {
> -		long	last = filled;
> +		long	alloc;
>  
> -		filled = alloc_pages_bulk(gfp_mask, bp->b_page_count,
> -					  bp->b_pages);
> +		alloc = alloc_pages_bulk(gfp_mask, bp->b_page_count - filled,
> +					 bp->b_pages + filled);
> +		filled += alloc;
>  		if (filled == bp->b_page_count) {
>  			XFS_STATS_INC(bp->b_mount, xb_page_found);
>  			break;
>  		}
>  
> -		if (filled != last)
> +		if (alloc)
>  			continue;
>  
>  		if (flags & XBF_READ_AHEAD) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f07c95eb5ac1..625d14ee4a41 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4599,9 +4599,6 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>   * This is a batched version of the page allocator that attempts to
>   * allocate nr_pages quickly. Pages are added to the page_array.
>   *
> - * Note that only NULL elements are populated with pages and nr_pages
> - * is the maximum number of pages that will be stored in the array.
> - *
>   * Returns the number of pages in the array.
>   */
>  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> @@ -4617,29 +4614,18 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	struct alloc_context ac;
>  	gfp_t alloc_gfp;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
> -	int nr_populated = 0, nr_account = 0;
> -
> -	/*
> -	 * Skip populated array elements to determine if any pages need
> -	 * to be allocated before disabling IRQs.
> -	 */
> -	while (nr_populated < nr_pages && page_array[nr_populated])
> -		nr_populated++;
> +	int nr_populated = 0;
>  
>  	/* No pages requested? */
>  	if (unlikely(nr_pages <= 0))
>  		goto out;
>  
> -	/* Already populated array? */
> -	if (unlikely(nr_pages - nr_populated == 0))
> -		goto out;
> -
>  	/* Bulk allocator does not support memcg accounting. */
>  	if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT))
>  		goto failed;
>  
>  	/* Use the single page allocator for one page. */
> -	if (nr_pages - nr_populated == 1)
> +	if (nr_pages == 1)
>  		goto failed;
>  
>  #ifdef CONFIG_PAGE_OWNER
> @@ -4711,24 +4697,16 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	/* Attempt the batch allocation */
>  	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
>  	while (nr_populated < nr_pages) {
> -
> -		/* Skip existing pages */
> -		if (page_array[nr_populated]) {
> -			nr_populated++;
> -			continue;
> -		}
> -
>  		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
>  								pcp, pcp_list);
>  		if (unlikely(!page)) {
>  			/* Try and allocate at least one page */
> -			if (!nr_account) {
> +			if (!nr_populated) {
>  				pcp_spin_unlock(pcp);
>  				goto failed_irq;
>  			}
>  			break;
>  		}
> -		nr_account++;
>  
>  		prep_new_page(page, 0, gfp, 0);
>  		set_page_refcounted(page);
> @@ -4738,8 +4716,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	pcp_spin_unlock(pcp);
>  	pcp_trylock_finish(UP_flags);
>  
> -	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
> -	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
> +	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_populated);
> +	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_populated);
>  
>  out:
>  	return nr_populated;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index acef1fcd8ddc..200b99375cb6 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -544,9 +544,6 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
>  	if (unlikely(pool->alloc.count > 0))
>  		return pool->alloc.cache[--pool->alloc.count];
>  
> -	/* Mark empty alloc.cache slots "empty" for alloc_pages_bulk */
> -	memset(&pool->alloc.cache, 0, sizeof(void *) * bulk);
> -
>  	nr_pages = alloc_pages_bulk_node(gfp, pool->p.nid, bulk,
>  					 (struct page **)pool->alloc.cache);
>  	if (unlikely(!nr_pages))
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index ae25405d8bd2..80fbc4ffef6d 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -663,9 +663,23 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
>  		pages = RPCSVC_MAXPAGES;
>  	}
>  
> -	for (filled = 0; filled < pages; filled = ret) {
> -		ret = alloc_pages_bulk(GFP_KERNEL, pages, rqstp->rq_pages);
> -		if (ret > filled)
> +	/* Defragment the rqstp->rq_pages so pages can be bulk allocated into
> +	 * remaining NULL elements sequentially.
> +	 */
> +	for (filled = 0, ret = 0; ret < pages; ret++) {
> +		if (rqstp->rq_pages[ret]) {
> +			rqstp->rq_pages[filled] = rqstp->rq_pages[ret];
> +			if (ret != filled)
> +				rqstp->rq_pages[ret] = NULL;
> +
> +			filled++;
> +		}
> +	}
> +
> +	for (; filled < pages; filled += ret) {
> +		ret = alloc_pages_bulk(GFP_KERNEL, pages - filled,
> +				       rqstp->rq_pages + filled);
> +		if (ret)
>  			/* Made progress, don't sleep yet */
>  			continue;
>  
> @@ -674,7 +688,7 @@ static bool svc_alloc_arg(struct svc_rqst *rqstp)
>  			set_current_state(TASK_RUNNING);
>  			return false;
>  		}
> -		trace_svc_alloc_arg_err(pages, ret);
> +		trace_svc_alloc_arg_err(pages, filled);
>  		memalloc_retry_wait(GFP_KERNEL);
>  	}
>  	rqstp->rq_page_end = &rqstp->rq_pages[pages];
Dave Chinner March 4, 2025, 8:18 a.m. UTC | #2
On Fri, Feb 28, 2025 at 05:44:20PM +0800, Yunsheng Lin wrote:
> As mentioned in [1], it seems odd to check NULL elements in
> the middle of page bulk allocating, and it seems caller can
> do a better job of bulk allocating pages into a whole array
> sequentially without checking NULL elements first before
> doing the page bulk allocation for most of existing users.
> 
> Through analyzing of bulk allocation API used in fs, it
> seems that the callers are depending on the assumption of
> populating only NULL elements in fs/btrfs/extent_io.c and
> net/sunrpc/svc_xprt.c while erofs and btrfs don't, see:
> commit 91d6ac1d62c3 ("btrfs: allocate page arrays using bulk page allocator")
> commit d6db47e571dc ("erofs: do not use pagepool in z_erofs_gbuf_growsize()")
> commit c9fa563072e1 ("xfs: use alloc_pages_bulk_array() for buffers")
> commit f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
> 
> Change SUNRPC and btrfs to not depend on the assumption.
> Other existing callers seems to be passing all NULL elements
> via memset, kzalloc, etc.
> 
> Remove assumption of populating only NULL elements and treat
> page_array as output parameter like kmem_cache_alloc_bulk().
> Remove the above assumption also enable the caller to not
> zero the array before calling the page bulk allocating API,
> which has about 1~2 ns performance improvement for the test
> case of time_bench_page_pool03_slow() for page_pool in a
> x86 vm system, this reduces some performance impact of
> fixing the DMA API misuse problem in [2], performance
> improves from 87.886 ns to 86.429 ns.

How much slower did you make btrfs and sunrpc by adding all the
defragmenting code there?

> 
> 1. https://lore.kernel.org/all/bd8c2f5c-464d-44ab-b607-390a87ea4cd5@huawei.com/
> 2. https://lore.kernel.org/all/20250212092552.1779679-1-linyunsheng@huawei.com/
> CC: Jesper Dangaard Brouer <hawk@kernel.org>
> CC: Luiz Capitulino <luizcap@redhat.com>
> CC: Mel Gorman <mgorman@techsingularity.net>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Acked-by: Jeff Layton <jlayton@kernel.org>
> ---
> V2:
> 1. Drop RFC tag and rebased on latest linux-next.
> 2. Fix a compile error for xfs.

And you still haven't tested the code changes to XFS, because
this patch is also broken.

> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 5d560e9073f4..b4e95b2dd0f0 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -319,16 +319,17 @@ xfs_buf_alloc_pages(
>  	 * least one extra page.
>  	 */
>  	for (;;) {
> -		long	last = filled;
> +		long	alloc;
>  
> -		filled = alloc_pages_bulk(gfp_mask, bp->b_page_count,
> -					  bp->b_pages);
> +		alloc = alloc_pages_bulk(gfp_mask, bp->b_page_count - filled,
> +					 bp->b_pages + filled);
> +		filled += alloc;
>  		if (filled == bp->b_page_count) {
>  			XFS_STATS_INC(bp->b_mount, xb_page_found);
>  			break;
>  		}
>  
> -		if (filled != last)
> +		if (alloc)
>  			continue;

alloc_pages_bulk() now returns the number of pages allocated in the
array. So if we ask for 4 pages, then get 2, filled is now 2. Then
we loop, ask for another 2 pages, get those two pages and it returns
4. Now filled is 6, and we continue.

Now we ask alloc_pages_bulk() for -2 pages, which returns 4 pages...

Worse behaviour: second time around, no page allocation succeeds
so it returns 2 pages. Filled is now 4, which is the number of pages
we need, so we break out of the loop with only 2 pages allocated.
There's about to be kernel crashes occur.....

Once is a mistake, twice is compeltely unacceptable.  When XFS stops
using alloc_pages_bulk (probably 6.15) I won't care anymore. But
until then, please stop trying to change this code.

NACK.

-Dave.
Qu Wenruo March 4, 2025, 9:17 a.m. UTC | #3
在 2025/2/28 20:14, Yunsheng Lin 写道:
> As mentioned in [1], it seems odd to check NULL elements in
> the middle of page bulk allocating, and it seems caller can
> do a better job of bulk allocating pages into a whole array
> sequentially without checking NULL elements first before
> doing the page bulk allocation for most of existing users.
> 
> Through analyzing of bulk allocation API used in fs, it
> seems that the callers are depending on the assumption of
> populating only NULL elements in fs/btrfs/extent_io.c and
> net/sunrpc/svc_xprt.c while erofs and btrfs don't, see:
> commit 91d6ac1d62c3 ("btrfs: allocate page arrays using bulk page allocator")

If you want to change the btrfs part, please run full fstests with 
SCRATCH_DEV_POOL populated at least.

[...]
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index f0a1da40d641..ef52cedd9873 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -623,13 +623,26 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
>   			   bool nofail)
>   {
>   	const gfp_t gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
> -	unsigned int allocated;
> +	unsigned int allocated, ret;
>   
> -	for (allocated = 0; allocated < nr_pages;) {
> -		unsigned int last = allocated;
> +	/* Defragment page_array so pages can be bulk allocated into remaining
> +	 * NULL elements sequentially.
> +	 */
> +	for (allocated = 0, ret = 0; ret < nr_pages; ret++) {
> +		if (page_array[ret]) {

You just prove how bad the design is.

All the callers have their page array members to initialized to NULL, or 
do not care and just want alloc_pages_bulk() to overwrite the 
uninitialized values.

The best example here is btrfs_encoded_read_regular().
Now your code will just crash encoded read.

Read the context before doing stupid things.

I find it unacceptable that you just change the code, without any 
testing, nor even just check all the involved callers.

> +			page_array[allocated] = page_array[ret];
> +			if (ret != allocated)
> +				page_array[ret] = NULL;
> +
> +			allocated++;
> +		}
> +	}
>   
> -		allocated = alloc_pages_bulk(gfp, nr_pages, page_array);
> -		if (unlikely(allocated == last)) {
> +	while (allocated < nr_pages) {
> +		ret = alloc_pages_bulk(gfp, nr_pages - allocated,
> +				       page_array + allocated);

I see the new interface way worse than the existing one.

All btrfs usage only wants a simple retry-until-all-fulfilled behavior.

NACK for btrfs part, and I find you very unresponsible not even bother 
running any testsuit and just submit such a mess.

Just stop this, no one will ever take you serious anymore.

Thanks,
Qu
Yunsheng Lin March 4, 2025, 12:04 p.m. UTC | #4
On 2025/3/4 6:13, Chuck Lever wrote:
> On 2/28/25 4:44 AM, Yunsheng Lin wrote:
>> As mentioned in [1], it seems odd to check NULL elements in
>> the middle of page bulk allocating, and it seems caller can
>> do a better job of bulk allocating pages into a whole array
>> sequentially without checking NULL elements first before
>> doing the page bulk allocation for most of existing users.
> 
> Sorry, but this still makes a claim without providing any data
> to back it up. Why can callers "do a better job"?

What I meant "do a better job" is that callers are already keeping
track of how many pages have been allocated, and it seems convenient
to just pass 'page_array + nr_allocated' and 'nr_pages - nr_allocated'
when calling subsequent page bulk alloc API so that NULL checking
can be avoided, which seems to be the pattern I see in
alloc_pages_bulk_interleave().

> 
> 
>> Through analyzing of bulk allocation API used in fs, it
>> seems that the callers are depending on the assumption of
>> populating only NULL elements in fs/btrfs/extent_io.c and
>> net/sunrpc/svc_xprt.c while erofs and btrfs don't, see:
>> commit 91d6ac1d62c3 ("btrfs: allocate page arrays using bulk page allocator")
>> commit d6db47e571dc ("erofs: do not use pagepool in z_erofs_gbuf_growsize()")
>> commit c9fa563072e1 ("xfs: use alloc_pages_bulk_array() for buffers")
>> commit f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
>>
>> Change SUNRPC and btrfs to not depend on the assumption.
>> Other existing callers seems to be passing all NULL elements
>> via memset, kzalloc, etc.
>>
>> Remove assumption of populating only NULL elements and treat
>> page_array as output parameter like kmem_cache_alloc_bulk().
>> Remove the above assumption also enable the caller to not
>> zero the array before calling the page bulk allocating API,
>> which has about 1~2 ns performance improvement for the test
>> case of time_bench_page_pool03_slow() for page_pool in a
>> x86 vm system, this reduces some performance impact of
>> fixing the DMA API misuse problem in [2], performance
>> improves from 87.886 ns to 86.429 ns.
>>
>> 1. https://lore.kernel.org/all/bd8c2f5c-464d-44ab-b607-390a87ea4cd5@huawei.com/
>> 2. https://lore.kernel.org/all/20250212092552.1779679-1-linyunsheng@huawei.com/
>> CC: Jesper Dangaard Brouer <hawk@kernel.org>
>> CC: Luiz Capitulino <luizcap@redhat.com>
>> CC: Mel Gorman <mgorman@techsingularity.net>
>> CC: Dave Chinner <david@fromorbit.com>
>> CC: Chuck Lever <chuck.lever@oracle.com>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> Acked-by: Jeff Layton <jlayton@kernel.org>
>> ---
>> V2:
>> 1. Drop RFC tag and rebased on latest linux-next.
>> 2. Fix a compile error for xfs.
>> 3. Defragmemt the page_array for SUNRPC and btrfs.
>> ---
>>  drivers/vfio/pci/virtio/migrate.c |  2 --
>>  fs/btrfs/extent_io.c              | 23 +++++++++++++++++-----
>>  fs/erofs/zutil.c                  | 12 ++++++------
>>  fs/xfs/xfs_buf.c                  |  9 +++++----
>>  mm/page_alloc.c                   | 32 +++++--------------------------
>>  net/core/page_pool.c              |  3 ---
>>  net/sunrpc/svc_xprt.c             | 22 +++++++++++++++++----
>>  7 files changed, 52 insertions(+), 51 deletions(-)
> 
> 52:51 is not an improvement. 1-2 ns is barely worth mentioning. The
> sunrpc and btrfs callers are more complex and carry duplicated code.

Yes, the hard part is to find common file to place the common function
as something as below:

void defragment_pointer_array(void** array, int size) {
    int slow = 0;
    for (int fast = 0; fast < size; fast++) {
        if (array[fast] != NULL) {
            swap(&array[fast], &array[slow]);
            slow++;
        }
    }
}

Or introduce a function as something like alloc_pages_refill_array()
for the usecase of sunrpc and xfs and do the array defragment in
alloc_pages_refill_array() before calling alloc_pages_bulk()?
Any suggestion?

> 
> Not an outright objection from me, but it's hard to justify this change.

The above should reduce the number to something like 40:51.

Also, I looked more closely at other callers calling alloc_pages_bulk(),
it seems vm_module_tags_populate() doesn't clear next_page[i] to NULL after
__free_page() when doing 'Clean up and error out', I am not sure if
vm_module_tags_populate() will be called multi-times as vm_module_tags->pages
seems to be only set to all NULL once, see alloc_tag_init() -> alloc_mod_tags_mem().

Cc Suren to see if there is some clarifying for the above.
Yunsheng Lin March 4, 2025, 12:09 p.m. UTC | #5
On 2025/3/4 16:18, Dave Chinner wrote:

...

> 
>>
>> 1. https://lore.kernel.org/all/bd8c2f5c-464d-44ab-b607-390a87ea4cd5@huawei.com/
>> 2. https://lore.kernel.org/all/20250212092552.1779679-1-linyunsheng@huawei.com/
>> CC: Jesper Dangaard Brouer <hawk@kernel.org>
>> CC: Luiz Capitulino <luizcap@redhat.com>
>> CC: Mel Gorman <mgorman@techsingularity.net>
>> CC: Dave Chinner <david@fromorbit.com>
>> CC: Chuck Lever <chuck.lever@oracle.com>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> Acked-by: Jeff Layton <jlayton@kernel.org>
>> ---
>> V2:
>> 1. Drop RFC tag and rebased on latest linux-next.
>> 2. Fix a compile error for xfs.
> 
> And you still haven't tested the code changes to XFS, because
> this patch is also broken.

I tested XFS using the below cmd and testcase, testing seems
to be working fine, or am I missing something obvious here
as I am not realy familiar with fs subsystem yet:

Step to setup the xfs:
dd if=/dev/zero of=xfs_image bs=1M count=1024
losetup -f xfs_image
losetup -a
./mkfs.xfs /dev/loop0
mkdir xfs_test
mount /dev/loop0 xfs_test/

Test shell file:
#!/bin/bash

# Configuration parameters
DIR="/home/xfs_test"              # Directory to perform file operations
FILE_COUNT=100              # Maximum number of files to create in each loop
MAX_FILE_SIZE=1024          # Maximum file size in KB
MIN_FILE_SIZE=10            # Minimum file size in KB
OPERATIONS=10               # Number of create/delete operations per loop
TOTAL_RUNS=10000               # Total number of loops to run

# Check if the directory exists
if [ ! -d "$DIR" ]; then
    echo "Directory $DIR does not exist. Please create the directory first!"
    exit 1
fi

echo "Starting file system test on: $DIR"

for ((run=1; run<=TOTAL_RUNS; run++)); do
    echo "Run $run of $TOTAL_RUNS"

    # Randomly create files
    for ((i=1; i<=OPERATIONS; i++)); do
        # Generate a random file size between MIN_FILE_SIZE and MAX_FILE_SIZE (in KB)
        FILE_SIZE=$((RANDOM % (MAX_FILE_SIZE - MIN_FILE_SIZE + 1) + MIN_FILE_SIZE))
        # Generate a unique file name using timestamp and random number
        FILE_NAME="$DIR/file_$(date +%s)_$RANDOM"
        # Create a file with random content
        dd if=/dev/urandom of="$FILE_NAME" bs=1K count=$FILE_SIZE &>/dev/null
        echo "Created file: $FILE_NAME, Size: $FILE_SIZE KB"
    done

    # Randomly delete files
    for ((i=1; i<=OPERATIONS; i++)); do
        # List all files in the directory
        FILE_LIST=($(ls $DIR))
        # Check if there are any files to delete
        if [ ${#FILE_LIST[@]} -gt 0 ]; then
            # Randomly select a file to delete
            RANDOM_FILE=${FILE_LIST[$RANDOM % ${#FILE_LIST[@]}]}
            rm -f "$DIR/$RANDOM_FILE"
            echo "Deleted file: $DIR/$RANDOM_FILE"
        fi
    done

    echo "Completed run $run"
done

echo "Test completed!"


> 
>> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
>> index 5d560e9073f4..b4e95b2dd0f0 100644
>> --- a/fs/xfs/xfs_buf.c
>> +++ b/fs/xfs/xfs_buf.c
>> @@ -319,16 +319,17 @@ xfs_buf_alloc_pages(
>>  	 * least one extra page.
>>  	 */
>>  	for (;;) {
>> -		long	last = filled;
>> +		long	alloc;
>>  
>> -		filled = alloc_pages_bulk(gfp_mask, bp->b_page_count,
>> -					  bp->b_pages);
>> +		alloc = alloc_pages_bulk(gfp_mask, bp->b_page_count - filled,
>> +					 bp->b_pages + filled);
>> +		filled += alloc;
>>  		if (filled == bp->b_page_count) {
>>  			XFS_STATS_INC(bp->b_mount, xb_page_found);
>>  			break;
>>  		}
>>  
>> -		if (filled != last)
>> +		if (alloc)
>>  			continue;
> 
> alloc_pages_bulk() now returns the number of pages allocated in the
> array. So if we ask for 4 pages, then get 2, filled is now 2. Then
> we loop, ask for another 2 pages, get those two pages and it returns
> 4. Now filled is 6, and we continue.

It will be returning 2 instead of 4 for the second loop if I understand
it correctly as 'bp->b_pages + filled' and 'bp->b_page_count - filled'
is passing to alloc_pages_bulk() API now.

> 
> Now we ask alloc_pages_bulk() for -2 pages, which returns 4 pages...
> 
> Worse behaviour: second time around, no page allocation succeeds
> so it returns 2 pages. Filled is now 4, which is the number of pages
> we need, so we break out of the loop with only 2 pages allocated.
> There's about to be kernel crashes occur.....
> 
> Once is a mistake, twice is compeltely unacceptable.  When XFS stops
> using alloc_pages_bulk (probably 6.15) I won't care anymore. But
> until then, please stop trying to change this code.
> 
> NACK.
> 
> -Dave.
Yunsheng Lin March 5, 2025, 12:17 p.m. UTC | #6
On 2025/3/4 17:17, Qu Wenruo wrote:
> 
> 
> 在 2025/2/28 20:14, Yunsheng Lin 写道:
>> As mentioned in [1], it seems odd to check NULL elements in
>> the middle of page bulk allocating, and it seems caller can
>> do a better job of bulk allocating pages into a whole array
>> sequentially without checking NULL elements first before
>> doing the page bulk allocation for most of existing users.
>>
>> Through analyzing of bulk allocation API used in fs, it
>> seems that the callers are depending on the assumption of
>> populating only NULL elements in fs/btrfs/extent_io.c and
>> net/sunrpc/svc_xprt.c while erofs and btrfs don't, see:

I should have said 'while erofs and xfs don't depend on the
assumption of populating only NULL elements'.

>> commit 91d6ac1d62c3 ("btrfs: allocate page arrays using bulk page allocator")
> 
> If you want to change the btrfs part, please run full fstests with SCRATCH_DEV_POOL populated at least.

The above is a helpful suggestion/comment to someone like me, who
is not very familiar with fs yet, thanks for the suggestion.

But I am not sure about some of the other comments below.

> 
> [...]
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index f0a1da40d641..ef52cedd9873 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -623,13 +623,26 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
>>                  bool nofail)
>>   {
>>       const gfp_t gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
>> -    unsigned int allocated;
>> +    unsigned int allocated, ret;
>>   -    for (allocated = 0; allocated < nr_pages;) {
>> -        unsigned int last = allocated;
>> +    /* Defragment page_array so pages can be bulk allocated into remaining
>> +     * NULL elements sequentially.
>> +     */
>> +    for (allocated = 0, ret = 0; ret < nr_pages; ret++) {
>> +        if (page_array[ret]) {
> 
> You just prove how bad the design is.
> 

Below is the reason you think the design is bad? If not, it would be
good to be more specific about why it is a bad design.

> All the callers have their page array members to initialized to NULL, or do not care and just want alloc_pages_bulk() to overwrite the uninitialized values.

Actually there are two use cases here as mentioned in the commit log:
1. Allocating an array of pages sequentially by providing an array as
   output parameter.
2. Refilling pages to NULL elements in an array by providing an array
   as both input and output parameter.

Most of users calling the bulk alloc API is allocating an array of pages
sequentially except btrfs and sunrpc, the current page bulk alloc API
implementation is not only doing the unnecessay NULL checking for most
users, but also require most of its callers to pass all NULL array via
memset, kzalloc, etc, which is also unnecessary overhead.

That means there is some space for improvement from performance and
easy-to-use perspective for most existing use cases here, this patch
just change alloc_pages_bulk() API to treat the page_array as only
the output parameter by mirroring kmem_cache_alloc_bulk() API.

For the existing btrfs and sunrpc case, I am agreed that there
might be valid use cases too, we just need to discuss how to
meet the requirements of different use cases using simpler, more
unified and effective APIs.

> 
> The best example here is btrfs_encoded_read_regular().
> Now your code will just crash encoded read.

It would be good to be more specific about the 'crash' here,
as simple testing mentioned in below seems fine for btrfs fs
too, but I will do more testing by running full fstests with
SCRATCH_DEV_POOL populated after I learn how to use the fstests.

https://lore.kernel.org/all/91fcdfca-3e7b-417c-ab26-7d5e37853431@huawei.com/

> 
> Read the context before doing stupid things.
> 
> I find it unacceptable that you just change the code, without any testing, nor even just check all the involved callers.

What exactly is the above 'context' is referring to? If it is a good advice,
I think I will take it seriously.

May I suggest that it might be better to be more humble and discuss
more before jumpping to conclusion here as it seems hard for one
person to be familiar with all the subsystem in the kernel?

> 
>> +            page_array[allocated] = page_array[ret];
>> +            if (ret != allocated)
>> +                page_array[ret] = NULL;
>> +
>> +            allocated++;
>> +        }
>> +    }
>>   -        allocated = alloc_pages_bulk(gfp, nr_pages, page_array);
>> -        if (unlikely(allocated == last)) {
>> +    while (allocated < nr_pages) {
>> +        ret = alloc_pages_bulk(gfp, nr_pages - allocated,
>> +                       page_array + allocated);
> 
> I see the new interface way worse than the existing one.
> 
> All btrfs usage only wants a simple retry-until-all-fulfilled behavior.

As above, I am agreed that the above might be what btrfs usage want, so
let's discuss how to meet the requirements of different use cases using
simpler, more unified and effective API, like introducing a function like
alloc_pages_refill_array() to meet the above requirement as mentioned in
below?
https://lore.kernel.org/all/74827bc7-ec6e-4e3a-9d19-61c4a9ba6b2c@huawei.com/

> 
> NACK for btrfs part, and I find you very unresponsible not even bother running any testsuit and just submit such a mess.
> 
> Just stop this, no one will ever take you serious anymore.
> 
> Thanks,
> Qu
> 
>
NeilBrown March 5, 2025, 11:41 p.m. UTC | #7
On Wed, 05 Mar 2025, Yunsheng Lin wrote:
> 
> For the existing btrfs and sunrpc case, I am agreed that there
> might be valid use cases too, we just need to discuss how to
> meet the requirements of different use cases using simpler, more
> unified and effective APIs.

We don't need "more unified".

If there are genuinely two different use cases with clearly different
needs - even if only slightly different - then it is acceptable to have
two different interfaces.  Be sure to choose names which emphasise the
differences.

Thanks,
NeilBrown
diff mbox series

Patch

diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
index ba92bb4e9af9..9f003a237dec 100644
--- a/drivers/vfio/pci/virtio/migrate.c
+++ b/drivers/vfio/pci/virtio/migrate.c
@@ -91,8 +91,6 @@  static int virtiovf_add_migration_pages(struct virtiovf_data_buffer *buf,
 		if (ret)
 			goto err_append;
 		buf->allocated_length += filled * PAGE_SIZE;
-		/* clean input for another bulk allocation */
-		memset(page_list, 0, filled * sizeof(*page_list));
 		to_fill = min_t(unsigned int, to_alloc,
 				PAGE_SIZE / sizeof(*page_list));
 	} while (to_alloc > 0);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f0a1da40d641..ef52cedd9873 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -623,13 +623,26 @@  int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
 			   bool nofail)
 {
 	const gfp_t gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
-	unsigned int allocated;
+	unsigned int allocated, ret;
 
-	for (allocated = 0; allocated < nr_pages;) {
-		unsigned int last = allocated;
+	/* Defragment page_array so pages can be bulk allocated into remaining
+	 * NULL elements sequentially.
+	 */
+	for (allocated = 0, ret = 0; ret < nr_pages; ret++) {
+		if (page_array[ret]) {
+			page_array[allocated] = page_array[ret];
+			if (ret != allocated)
+				page_array[ret] = NULL;
+
+			allocated++;
+		}
+	}
 
-		allocated = alloc_pages_bulk(gfp, nr_pages, page_array);
-		if (unlikely(allocated == last)) {
+	while (allocated < nr_pages) {
+		ret = alloc_pages_bulk(gfp, nr_pages - allocated,
+				       page_array + allocated);
+		allocated += ret;
+		if (unlikely(!ret)) {
 			/* No progress, fail and do cleanup. */
 			for (int i = 0; i < allocated; i++) {
 				__free_page(page_array[i]);
diff --git a/fs/erofs/zutil.c b/fs/erofs/zutil.c
index 55ff2ab5128e..1c50b5e27371 100644
--- a/fs/erofs/zutil.c
+++ b/fs/erofs/zutil.c
@@ -85,13 +85,13 @@  int z_erofs_gbuf_growsize(unsigned int nrpages)
 
 		for (j = 0; j < gbuf->nrpages; ++j)
 			tmp_pages[j] = gbuf->pages[j];
-		do {
-			last = j;
-			j = alloc_pages_bulk(GFP_KERNEL, nrpages,
-					     tmp_pages);
-			if (last == j)
+
+		for (last = j; last < nrpages; last += j) {
+			j = alloc_pages_bulk(GFP_KERNEL, nrpages - last,
+					     tmp_pages + last);
+			if (!j)
 				goto out;
-		} while (j != nrpages);
+		}
 
 		ptr = vmap(tmp_pages, nrpages, VM_MAP, PAGE_KERNEL);
 		if (!ptr)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5d560e9073f4..b4e95b2dd0f0 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -319,16 +319,17 @@  xfs_buf_alloc_pages(
 	 * least one extra page.
 	 */
 	for (;;) {
-		long	last = filled;
+		long	alloc;
 
-		filled = alloc_pages_bulk(gfp_mask, bp->b_page_count,
-					  bp->b_pages);
+		alloc = alloc_pages_bulk(gfp_mask, bp->b_page_count - filled,
+					 bp->b_pages + filled);
+		filled += alloc;
 		if (filled == bp->b_page_count) {
 			XFS_STATS_INC(bp->b_mount, xb_page_found);
 			break;
 		}
 
-		if (filled != last)
+		if (alloc)
 			continue;
 
 		if (flags & XBF_READ_AHEAD) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f07c95eb5ac1..625d14ee4a41 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4599,9 +4599,6 @@  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
  * This is a batched version of the page allocator that attempts to
  * allocate nr_pages quickly. Pages are added to the page_array.
  *
- * Note that only NULL elements are populated with pages and nr_pages
- * is the maximum number of pages that will be stored in the array.
- *
  * Returns the number of pages in the array.
  */
 unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
@@ -4617,29 +4614,18 @@  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	struct alloc_context ac;
 	gfp_t alloc_gfp;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
-	int nr_populated = 0, nr_account = 0;
-
-	/*
-	 * Skip populated array elements to determine if any pages need
-	 * to be allocated before disabling IRQs.
-	 */
-	while (nr_populated < nr_pages && page_array[nr_populated])
-		nr_populated++;
+	int nr_populated = 0;
 
 	/* No pages requested? */
 	if (unlikely(nr_pages <= 0))
 		goto out;
 
-	/* Already populated array? */
-	if (unlikely(nr_pages - nr_populated == 0))
-		goto out;
-
 	/* Bulk allocator does not support memcg accounting. */
 	if (memcg_kmem_online() && (gfp & __GFP_ACCOUNT))
 		goto failed;
 
 	/* Use the single page allocator for one page. */
-	if (nr_pages - nr_populated == 1)
+	if (nr_pages == 1)
 		goto failed;
 
 #ifdef CONFIG_PAGE_OWNER
@@ -4711,24 +4697,16 @@  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	/* Attempt the batch allocation */
 	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
 	while (nr_populated < nr_pages) {
-
-		/* Skip existing pages */
-		if (page_array[nr_populated]) {
-			nr_populated++;
-			continue;
-		}
-
 		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
 								pcp, pcp_list);
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
-			if (!nr_account) {
+			if (!nr_populated) {
 				pcp_spin_unlock(pcp);
 				goto failed_irq;
 			}
 			break;
 		}
-		nr_account++;
 
 		prep_new_page(page, 0, gfp, 0);
 		set_page_refcounted(page);
@@ -4738,8 +4716,8 @@  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	pcp_spin_unlock(pcp);
 	pcp_trylock_finish(UP_flags);
 
-	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
-	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
+	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_populated);
+	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_populated);
 
 out:
 	return nr_populated;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index acef1fcd8ddc..200b99375cb6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -544,9 +544,6 @@  static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 	if (unlikely(pool->alloc.count > 0))
 		return pool->alloc.cache[--pool->alloc.count];
 
-	/* Mark empty alloc.cache slots "empty" for alloc_pages_bulk */
-	memset(&pool->alloc.cache, 0, sizeof(void *) * bulk);
-
 	nr_pages = alloc_pages_bulk_node(gfp, pool->p.nid, bulk,
 					 (struct page **)pool->alloc.cache);
 	if (unlikely(!nr_pages))
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index ae25405d8bd2..80fbc4ffef6d 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -663,9 +663,23 @@  static bool svc_alloc_arg(struct svc_rqst *rqstp)
 		pages = RPCSVC_MAXPAGES;
 	}
 
-	for (filled = 0; filled < pages; filled = ret) {
-		ret = alloc_pages_bulk(GFP_KERNEL, pages, rqstp->rq_pages);
-		if (ret > filled)
+	/* Defragment the rqstp->rq_pages so pages can be bulk allocated into
+	 * remaining NULL elements sequentially.
+	 */
+	for (filled = 0, ret = 0; ret < pages; ret++) {
+		if (rqstp->rq_pages[ret]) {
+			rqstp->rq_pages[filled] = rqstp->rq_pages[ret];
+			if (ret != filled)
+				rqstp->rq_pages[ret] = NULL;
+
+			filled++;
+		}
+	}
+
+	for (; filled < pages; filled += ret) {
+		ret = alloc_pages_bulk(GFP_KERNEL, pages - filled,
+				       rqstp->rq_pages + filled);
+		if (ret)
 			/* Made progress, don't sleep yet */
 			continue;
 
@@ -674,7 +688,7 @@  static bool svc_alloc_arg(struct svc_rqst *rqstp)
 			set_current_state(TASK_RUNNING);
 			return false;
 		}
-		trace_svc_alloc_arg_err(pages, ret);
+		trace_svc_alloc_arg_err(pages, filled);
 		memalloc_retry_wait(GFP_KERNEL);
 	}
 	rqstp->rq_page_end = &rqstp->rq_pages[pages];