Message ID | 20210322091845.16437-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
Headers | show |
Series | Introduce a bulk order-0 page allocator | expand |
On Mon, 22 Mar 2021 09:18:42 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > This series is based on top of Matthew Wilcox's series "Rationalise > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > test and are not using Andrew's tree as a baseline, I suggest using the > following git tree > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9 page_bench04_bulk[1] micro-benchmark on branch: mm-bulk-rebase-v5r9 [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c BASELINE single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns LIST variant: time_bulk_page_alloc_free_list: step=bulk size Per elem: 206 cycles(tsc) 57.478 ns (step:1) Per elem: 154 cycles(tsc) 42.861 ns (step:2) Per elem: 145 cycles(tsc) 40.536 ns (step:3) Per elem: 142 cycles(tsc) 39.477 ns (step:4) Per elem: 142 cycles(tsc) 39.610 ns (step:8) Per elem: 137 cycles(tsc) 38.155 ns (step:16) Per elem: 135 cycles(tsc) 37.739 ns (step:32) Per elem: 134 cycles(tsc) 37.282 ns (step:64) Per elem: 133 cycles(tsc) 36.993 ns (step:128) ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size Per elem: 202 cycles(tsc) 56.383 ns (step:1) Per elem: 144 cycles(tsc) 40.047 ns (step:2) Per elem: 134 cycles(tsc) 37.339 ns (step:3) Per elem: 128 cycles(tsc) 35.578 ns (step:4) Per elem: 120 cycles(tsc) 33.592 ns (step:8) Per elem: 116 cycles(tsc) 32.362 ns (step:16) Per elem: 113 cycles(tsc) 31.476 ns (step:32) Per elem: 110 cycles(tsc) 30.633 ns (step:64) Per elem: 110 cycles(tsc) 30.596 ns (step:128) Compared to the previous results (see below) list-variant got faster, but array-variant is still faster. The array variant lost a little performance. I think this can be related to the stats counters got added/moved inside the loop, in this patchset. Previous results from: https://lore.kernel.org/netdev/20210319181031.44dd3113@carbon/ On Fri, 19 Mar 2021 18:10:31 +0100 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > BASELINE > single_page alloc+put: 207 cycles(tsc) 57.773 ns > > LIST variant: time_bulk_page_alloc_free_list: step=bulk size > > Per elem: 294 cycles(tsc) 81.866 ns (step:1) > Per elem: 214 cycles(tsc) 59.477 ns (step:2) > Per elem: 199 cycles(tsc) 55.504 ns (step:3) > Per elem: 192 cycles(tsc) 53.489 ns (step:4) > Per elem: 188 cycles(tsc) 52.456 ns (step:8) > Per elem: 184 cycles(tsc) 51.346 ns (step:16) > Per elem: 183 cycles(tsc) 50.883 ns (step:32) > Per elem: 184 cycles(tsc) 51.236 ns (step:64) > Per elem: 189 cycles(tsc) 52.620 ns (step:128) > > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size > > Per elem: 195 cycles(tsc) 54.174 ns (step:1) > Per elem: 123 cycles(tsc) 34.224 ns (step:2) > Per elem: 113 cycles(tsc) 31.430 ns (step:3) > Per elem: 108 cycles(tsc) 30.003 ns (step:4) > Per elem: 102 cycles(tsc) 28.417 ns (step:8) > Per elem: 98 cycles(tsc) 27.475 ns (step:16) > Per elem: 96 cycles(tsc) 26.901 ns (step:32) > Per elem: 95 cycles(tsc) 26.487 ns (step:64) > Per elem: 94 cycles(tsc) 26.170 ns (step:128) > The users of the API have been dropped in this version as the callers > need to check whether they prefer an array or list interface (whether > preference is based on convenience or performance). I'll start coding up the page_pool API user and benchmark that. > Changelog since v4 > o Drop users of the API > o Remove free_pages_bulk interface, no users In [1] benchmark I just open-coded free_pages_bulk(): https://github.com/netoptimizer/prototype-kernel/commit/49d224b19850b767c > o Add array interface > o Allocate single page if watermark checks on local zones fail
On Mon, Mar 22, 2021 at 01:04:46PM +0100, Jesper Dangaard Brouer wrote: > On Mon, 22 Mar 2021 09:18:42 +0000 > Mel Gorman <mgorman@techsingularity.net> wrote: > > > This series is based on top of Matthew Wilcox's series "Rationalise > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > > test and are not using Andrew's tree as a baseline, I suggest using the > > following git tree > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9 > > page_bench04_bulk[1] micro-benchmark on branch: mm-bulk-rebase-v5r9 > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c > > BASELINE > single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns > > LIST variant: time_bulk_page_alloc_free_list: step=bulk size > > Per elem: 206 cycles(tsc) 57.478 ns (step:1) > Per elem: 154 cycles(tsc) 42.861 ns (step:2) > Per elem: 145 cycles(tsc) 40.536 ns (step:3) > Per elem: 142 cycles(tsc) 39.477 ns (step:4) > Per elem: 142 cycles(tsc) 39.610 ns (step:8) > Per elem: 137 cycles(tsc) 38.155 ns (step:16) > Per elem: 135 cycles(tsc) 37.739 ns (step:32) > Per elem: 134 cycles(tsc) 37.282 ns (step:64) > Per elem: 133 cycles(tsc) 36.993 ns (step:128) > > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size > > Per elem: 202 cycles(tsc) 56.383 ns (step:1) > Per elem: 144 cycles(tsc) 40.047 ns (step:2) > Per elem: 134 cycles(tsc) 37.339 ns (step:3) > Per elem: 128 cycles(tsc) 35.578 ns (step:4) > Per elem: 120 cycles(tsc) 33.592 ns (step:8) > Per elem: 116 cycles(tsc) 32.362 ns (step:16) > Per elem: 113 cycles(tsc) 31.476 ns (step:32) > Per elem: 110 cycles(tsc) 30.633 ns (step:64) > Per elem: 110 cycles(tsc) 30.596 ns (step:128) > > Compared to the previous results (see below) list-variant got faster, > but array-variant is still faster. The array variant lost a little > performance. I think this can be related to the stats counters got > added/moved inside the loop, in this patchset. > If you are feeling particularly brave, take a look at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r10 It's a prototype series rebased on top of the bulk allocator and this version has not even been boot tested. While it'll get rough treatment during review, it should reduce the cost of the stat updates in the bulk allocator as a side-effect.
> On Mar 22, 2021, at 5:18 AM, Mel Gorman <mgorman@techsingularity.net> wrote: > > This series is based on top of Matthew Wilcox's series "Rationalise > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > test and are not using Andrew's tree as a baseline, I suggest using the > following git tree > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9 > > The users of the API have been dropped in this version as the callers > need to check whether they prefer an array or list interface (whether > preference is based on convenience or performance). I now have a consumer implementation that uses the array API. If I understand the contract correctly, the return value is the last array index that __alloc_pages_bulk() visits. My consumer uses the return value to determine if it needs to call the allocator again. It is returning some confusing (to me) results. I'd like to get these resolved before posting any benchmark results. 1. When it has visited every array element, it returns the same value as was passed in @nr_pages. That's the N + 1th array element, which shouldn't be touched. Should the allocator return nr_pages - 1 in the fully successful case? Or should the documentation describe the return value as "the number of elements visited" ? 2. Frequently the allocator returns a number smaller than the total number of elements. As you may recall, sunrpc will delay a bit (via a call to schedule_timeout) then call again. This is supposed to be a rare event, and the delay is substantial. But with the array-based API, a not-fully- successful allocator call seems to happen more than half the time. Is that expected? I'm calling with GFP_KERNEL, seems like the allocator should be trying harder. 3. Is the current design intended so that if the consumer does call again, is it supposed to pass in the array address + the returned index (and @nr_pages reduced by the returned index) ? Thanks for all your hard work, Mel. > Changelog since v4 > o Drop users of the API > o Remove free_pages_bulk interface, no users > o Add array interface > o Allocate single page if watermark checks on local zones fail > > Changelog since v3 > o Rebase on top of Matthew's series consolidating the alloc_pages API > o Rename alloced to allocated > o Split out preparation patch for prepare_alloc_pages > o Defensive check for bulk allocation or <= 0 pages > o Call single page allocation path only if no pages were allocated > o Minor cosmetic cleanups > o Reorder patch dependencies by subsystem. As this is a cross-subsystem > series, the mm patches have to be merged before the sunrpc and net > users. > > Changelog since v2 > o Prep new pages with IRQs enabled > o Minor documentation update > > Changelog since v1 > o Parenthesise binary and boolean comparisons > o Add reviewed-bys > o Rebase to 5.12-rc2 > > This series introduces a bulk order-0 page allocator with the > intent that sunrpc and the network page pool become the first users. > The implementation is not particularly efficient and the intention is to > iron out what the semantics of the API should have for users. Despite > that, this is a performance-related enhancement for users that require > multiple pages for an operation without multiple round-trips to the page > allocator. Quoting the last patch for the prototype high-speed networking > use-case. > > For XDP-redirect workload with 100G mlx5 driver (that use page_pool) > redirecting xdp_frame packets into a veth, that does XDP_PASS to > create an SKB from the xdp_frame, which then cannot return the page > to the page_pool. In this case, we saw[1] an improvement of 18.8% > from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps). > > Both potential users in this series are corner cases (NFS and high-speed > networks) so it is unlikely that most users will see any benefit in the > short term. Other potential other users are batch allocations for page > cache readahead, fault around and SLUB allocations when high-order pages > are unavailable. It's unknown how much benefit would be seen by converting > multiple page allocation calls to a single batch or what difference it may > make to headline performance. It's a chicken and egg problem given that > the potential benefit cannot be investigated without an implementation > to test against. > > Light testing passed, I'm relying on Chuck and Jesper to test their > implementations, choose whether to use lists or arrays and document > performance gains/losses in the changelogs. > > Patch 1 renames a variable name that is particularly unpopular > > Patch 2 adds a bulk page allocator > > Patch 3 adds an array-based version of the bulk allocator > > include/linux/gfp.h | 18 +++++ > mm/page_alloc.c | 171 ++++++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 185 insertions(+), 4 deletions(-) > > -- > 2.26.2 > -- Chuck Lever
On Mon, Mar 22, 2021 at 06:25:03PM +0000, Chuck Lever III wrote: > > > > On Mar 22, 2021, at 5:18 AM, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > This series is based on top of Matthew Wilcox's series "Rationalise > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > > test and are not using Andrew's tree as a baseline, I suggest using the > > following git tree > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9 > > > > The users of the API have been dropped in this version as the callers > > need to check whether they prefer an array or list interface (whether > > preference is based on convenience or performance). > > I now have a consumer implementation that uses the array > API. If I understand the contract correctly, the return > value is the last array index that __alloc_pages_bulk() > visits. My consumer uses the return value to determine > if it needs to call the allocator again. > For either arrays or lists, the return value is the number of valid pages. For arrays, the pattern is expected to be nr_pages = alloc_pages_bulk(gfp, nr_requested, page_array); for (i = 0; i < nr_pages; i++) { do something with page_array[i] } There *could* be populated valid elements on and after nr_pages but the implementation did not visit those elements. The implementation can abort early if the array looks like this PPP....PPP Where P is a page and . is NULL. The implementation would skip the first three pages, allocate four pages and then abort when a new page was encountered. This is an implementation detail around how I handled prep_new_page. It could be addressed if many callers expect to pass in an array that has holes in the middle. > It is returning some confusing (to me) results. I'd like > to get these resolved before posting any benchmark > results. > > 1. When it has visited every array element, it returns the > same value as was passed in @nr_pages. That's the N + 1th > array element, which shouldn't be touched. Should the > allocator return nr_pages - 1 in the fully successful case? > Or should the documentation describe the return value as > "the number of elements visited" ? > I phrased it as "the known number of populated elements in the page_array". I did not want to write it as "the number of valid elements in the array" because that is not necessarily the case if an array is passed in with holes in the middle. I'm open to any suggestions on how the __alloc_pages_bulk description can be improved. The definition of the return value as-is makes sense for either a list or an array. Returning "nr_pages - 1" suits an array because it's the last valid index but it makes less sense when returning a list. > 2. Frequently the allocator returns a number smaller than > the total number of elements. As you may recall, sunrpc > will delay a bit (via a call to schedule_timeout) then call > again. This is supposed to be a rare event, and the delay > is substantial. But with the array-based API, a not-fully- > successful allocator call seems to happen more than half > the time. Is that expected? I'm calling with GFP_KERNEL, > seems like the allocator should be trying harder. > It's not expected that the array implementation would be worse *unless* you are passing in arrays with holes in the middle. Otherwise, the success rate should be similar. > 3. Is the current design intended so that if the consumer > does call again, is it supposed to pass in the array address > + the returned index (and @nr_pages reduced by the returned > index) ? > The caller does not have to pass in array address + returned index but it's more efficient if it does. If you are passing in arrays with holes in the middle then the following might work (not tested) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c83d38dfe936..4dc38516a5bd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5002,6 +5002,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, gfp_t alloc_gfp; unsigned int alloc_flags; int nr_populated = 0, prep_index = 0; + bool hole = false; if (WARN_ON_ONCE(nr_pages <= 0)) return 0; @@ -5057,6 +5058,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, if (!zone) goto failed; +retry_hole: /* Attempt the batch allocation */ local_irq_save(flags); pcp = &this_cpu_ptr(zone->pageset)->pcp; @@ -5069,6 +5071,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, * IRQs are enabled. */ if (page_array && page_array[nr_populated]) { + hole = true; nr_populated++; break; } @@ -5109,6 +5112,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, prep_new_page(page_array[prep_index++], 0, gfp, 0); } + if (hole && nr_populated < nr_pages && hole) + goto retry_hole; + return nr_populated; failed_irq:
> On Mar 22, 2021, at 3:49 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Mon, Mar 22, 2021 at 06:25:03PM +0000, Chuck Lever III wrote: >> >> >>> On Mar 22, 2021, at 5:18 AM, Mel Gorman <mgorman@techsingularity.net> wrote: >>> >>> This series is based on top of Matthew Wilcox's series "Rationalise >>> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to >>> test and are not using Andrew's tree as a baseline, I suggest using the >>> following git tree >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9 >>> >>> The users of the API have been dropped in this version as the callers >>> need to check whether they prefer an array or list interface (whether >>> preference is based on convenience or performance). >> >> I now have a consumer implementation that uses the array >> API. If I understand the contract correctly, the return >> value is the last array index that __alloc_pages_bulk() >> visits. My consumer uses the return value to determine >> if it needs to call the allocator again. >> > > For either arrays or lists, the return value is the number of valid > pages. For arrays, the pattern is expected to be > > nr_pages = alloc_pages_bulk(gfp, nr_requested, page_array); > for (i = 0; i < nr_pages; i++) { > do something with page_array[i] > } > > There *could* be populated valid elements on and after nr_pages but the > implementation did not visit those elements. The implementation can abort > early if the array looks like this > > PPP....PPP > > Where P is a page and . is NULL. The implementation would skip the > first three pages, allocate four pages and then abort when a new page > was encountered. This is an implementation detail around how I handled > prep_new_page. It could be addressed if many callers expect to pass in > an array that has holes in the middle. > >> It is returning some confusing (to me) results. I'd like >> to get these resolved before posting any benchmark >> results. >> >> 1. When it has visited every array element, it returns the >> same value as was passed in @nr_pages. That's the N + 1th >> array element, which shouldn't be touched. Should the >> allocator return nr_pages - 1 in the fully successful case? >> Or should the documentation describe the return value as >> "the number of elements visited" ? >> > > I phrased it as "the known number of populated elements in the > page_array". The comment you added states: + * For lists, nr_pages is the number of pages that should be allocated. + * + * For arrays, only NULL elements are populated with pages and nr_pages + * is the maximum number of pages that will be stored in the array. + * + * Returns the number of pages added to the page_list or the index of the + * last known populated element of page_array. > I did not want to write it as "the number of valid elements > in the array" because that is not necessarily the case if an array is > passed in with holes in the middle. I'm open to any suggestions on how > the __alloc_pages_bulk description can be improved. The comments states that, for the array case, a /count/ of pages is passed in, and an /index/ is returned. If you want to return the same type for lists and arrays, it should be documented as a count in both cases, to match @nr_pages. Consumers will want to compare @nr_pages with the return value to see if they need to call again. Comparing a count to an index is a notorious source of off-by-one errors. > The definition of the return value as-is makes sense for either a list > or an array. Returning "nr_pages - 1" suits an array because it's the > last valid index but it makes less sense when returning a list. > >> 2. Frequently the allocator returns a number smaller than >> the total number of elements. As you may recall, sunrpc >> will delay a bit (via a call to schedule_timeout) then call >> again. This is supposed to be a rare event, and the delay >> is substantial. But with the array-based API, a not-fully- >> successful allocator call seems to happen more than half >> the time. Is that expected? I'm calling with GFP_KERNEL, >> seems like the allocator should be trying harder. >> > > It's not expected that the array implementation would be worse *unless* > you are passing in arrays with holes in the middle. Otherwise, the success > rate should be similar. Essentially, sunrpc will always pass an array with a hole. Each RPC consumes the first N elements in the rq_pages array. Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not pass in an array with more than one hole. Typically: .....PPPP My results show that, because svc_alloc_arg() ends up calling __alloc_pages_bulk() twice in this case, it ends up being twice as expensive as the list case, on average, for the same workload. >> 3. Is the current design intended so that if the consumer >> does call again, is it supposed to pass in the array address >> + the returned index (and @nr_pages reduced by the returned >> index) ? >> > > The caller does not have to pass in array address + returned index but > it's more efficient if it does. > > If you are passing in arrays with holes in the middle then the following > might work (not tested) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c83d38dfe936..4dc38516a5bd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5002,6 +5002,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > gfp_t alloc_gfp; > unsigned int alloc_flags; > int nr_populated = 0, prep_index = 0; > + bool hole = false; > > if (WARN_ON_ONCE(nr_pages <= 0)) > return 0; > @@ -5057,6 +5058,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > if (!zone) > goto failed; > > +retry_hole: > /* Attempt the batch allocation */ > local_irq_save(flags); > pcp = &this_cpu_ptr(zone->pageset)->pcp; > @@ -5069,6 +5071,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > * IRQs are enabled. > */ > if (page_array && page_array[nr_populated]) { > + hole = true; > nr_populated++; > break; > } > @@ -5109,6 +5112,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > prep_new_page(page_array[prep_index++], 0, gfp, 0); > } > > + if (hole && nr_populated < nr_pages && hole) > + goto retry_hole; > + > return nr_populated; > > failed_irq: > > -- > Mel Gorman > SUSE Labs If a local_irq_save() is done more than once in this case, I don't expect that the result will be much better. To make the array API as performant as the list API, the sunrpc consumer will have to check if the N + 1th element is populated, upon return, rather than checking the return value against @nr_pages. -- Chuck Lever
On Mon, Mar 22, 2021 at 08:32:54PM +0000, Chuck Lever III wrote: > >> It is returning some confusing (to me) results. I'd like > >> to get these resolved before posting any benchmark > >> results. > >> > >> 1. When it has visited every array element, it returns the > >> same value as was passed in @nr_pages. That's the N + 1th > >> array element, which shouldn't be touched. Should the > >> allocator return nr_pages - 1 in the fully successful case? > >> Or should the documentation describe the return value as > >> "the number of elements visited" ? > >> > > > > I phrased it as "the known number of populated elements in the > > page_array". > > The comment you added states: > > + * For lists, nr_pages is the number of pages that should be allocated. > + * > + * For arrays, only NULL elements are populated with pages and nr_pages > + * is the maximum number of pages that will be stored in the array. > + * > + * Returns the number of pages added to the page_list or the index of the > + * last known populated element of page_array. > > > > I did not want to write it as "the number of valid elements > > in the array" because that is not necessarily the case if an array is > > passed in with holes in the middle. I'm open to any suggestions on how > > the __alloc_pages_bulk description can be improved. > > The comments states that, for the array case, a /count/ of > pages is passed in, and an /index/ is returned. If you want > to return the same type for lists and arrays, it should be > documented as a count in both cases, to match @nr_pages. > Consumers will want to compare @nr_pages with the return > value to see if they need to call again. > Then I'll just say it's the known count of pages in the array. That might still be less than the number of requested pages if holes are encountered. > > The definition of the return value as-is makes sense for either a list > > or an array. Returning "nr_pages - 1" suits an array because it's the > > last valid index but it makes less sense when returning a list. > > > >> 2. Frequently the allocator returns a number smaller than > >> the total number of elements. As you may recall, sunrpc > >> will delay a bit (via a call to schedule_timeout) then call > >> again. This is supposed to be a rare event, and the delay > >> is substantial. But with the array-based API, a not-fully- > >> successful allocator call seems to happen more than half > >> the time. Is that expected? I'm calling with GFP_KERNEL, > >> seems like the allocator should be trying harder. > >> > > > > It's not expected that the array implementation would be worse *unless* > > you are passing in arrays with holes in the middle. Otherwise, the success > > rate should be similar. > > Essentially, sunrpc will always pass an array with a hole. > Each RPC consumes the first N elements in the rq_pages array. > Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not > pass in an array with more than one hole. Typically: > > .....PPPP > > My results show that, because svc_alloc_arg() ends up calling > __alloc_pages_bulk() twice in this case, it ends up being > twice as expensive as the list case, on average, for the same > workload. > Ok, so in this case the caller knows that holes are always at the start. If the API returns an index that is a valid index and populated, it can check the next index and if it is valid then the whole array must be populated. Right now, the implementation checks for populated elements at the *start* because it is required for calling prep_new_page starting at the correct index and the API cannot make assumptions about the location of the hole. The patch below would check the rest of the array but note that it's slower for the API to do this check because it has to check every element while the sunrpc user could check one element. Let me know if a) this hunk helps and b) is desired behaviour. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c83d38dfe936..4bf20650e5f5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5107,6 +5107,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, } else { while (prep_index < nr_populated) prep_new_page(page_array[prep_index++], 0, gfp, 0); + + while (nr_populated < nr_pages && page_array[nr_populated]) + nr_populated++; } return nr_populated;
On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote: > This series is based on top of Matthew Wilcox's series "Rationalise > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > test and are not using Andrew's tree as a baseline, I suggest using the > following git tree > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9 > Jesper and Chuck, would you mind rebasing on top of the following branch please? git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2 The interface is the same so the rebase should be trivial. Jesper, I'm hoping you see no differences in performance but it's best to check. For Chuck, this version will check for holes and scan the remainder of the array to see if nr_pages are allocated before returning. If the holes in the array are always at the start (which it should be for sunrpc) then it should still be a single IRQ disable/enable. Specifically, each contiguous hole in the array will disable/enable IRQs once. I prototyped NFS array support and it had a 100% success rate with no sleeps running dbench over the network with no memory pressure but that's a basic test on a 10G switch. The basic patch I used to convert sunrpc from using lists to an array for testing is as follows; diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index 922118968986..0ce33c1742d9 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -642,12 +642,10 @@ static void svc_check_conn_limits(struct svc_serv *serv) static int svc_alloc_arg(struct svc_rqst *rqstp) { struct svc_serv *serv = rqstp->rq_server; - unsigned long needed; struct xdr_buf *arg; - struct page *page; LIST_HEAD(list); int pages; - int i; + int i = 0; pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT; if (pages > RPCSVC_MAXPAGES) { @@ -657,29 +655,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp) pages = RPCSVC_MAXPAGES; } - for (needed = 0, i = 0; i < pages ; i++) { - if (!rqstp->rq_pages[i]) - needed++; - } - i = 0; - while (needed) { - needed -= alloc_pages_bulk(GFP_KERNEL, needed, &list); - for (; i < pages; i++) { - if (rqstp->rq_pages[i]) - continue; - page = list_first_entry_or_null(&list, struct page, lru); - if (likely(page)) { - list_del(&page->lru); - rqstp->rq_pages[i] = page; - continue; - } + while (i < pages) { + i = alloc_pages_bulk_array(GFP_KERNEL, pages, &rqstp->rq_pages[0]); + if (i < pages) { set_current_state(TASK_INTERRUPTIBLE); if (signalled() || kthread_should_stop()) { set_current_state(TASK_RUNNING); return -EINTR; } schedule_timeout(msecs_to_jiffies(500)); - break; } } rqstp->rq_page_end = &rqstp->rq_pages[pages];
On Mon, 22 Mar 2021 20:58:27 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > On Mon, Mar 22, 2021 at 08:32:54PM +0000, Chuck Lever III wrote: > > >> It is returning some confusing (to me) results. I'd like > > >> to get these resolved before posting any benchmark > > >> results. > > >> > > >> 1. When it has visited every array element, it returns the > > >> same value as was passed in @nr_pages. That's the N + 1th > > >> array element, which shouldn't be touched. Should the > > >> allocator return nr_pages - 1 in the fully successful case? > > >> Or should the documentation describe the return value as > > >> "the number of elements visited" ? > > >> > > > > > > I phrased it as "the known number of populated elements in the > > > page_array". > > > > The comment you added states: > > > > + * For lists, nr_pages is the number of pages that should be allocated. > > + * > > + * For arrays, only NULL elements are populated with pages and nr_pages > > + * is the maximum number of pages that will be stored in the array. > > + * > > + * Returns the number of pages added to the page_list or the index of the > > + * last known populated element of page_array. > > > > > > > I did not want to write it as "the number of valid elements > > > in the array" because that is not necessarily the case if an array is > > > passed in with holes in the middle. I'm open to any suggestions on how > > > the __alloc_pages_bulk description can be improved. > > > > The comments states that, for the array case, a /count/ of > > pages is passed in, and an /index/ is returned. If you want > > to return the same type for lists and arrays, it should be > > documented as a count in both cases, to match @nr_pages. > > Consumers will want to compare @nr_pages with the return > > value to see if they need to call again. > > > > Then I'll just say it's the known count of pages in the array. That > might still be less than the number of requested pages if holes are > encountered. > > > > The definition of the return value as-is makes sense for either a list > > > or an array. Returning "nr_pages - 1" suits an array because it's the > > > last valid index but it makes less sense when returning a list. > > > > > >> 2. Frequently the allocator returns a number smaller than > > >> the total number of elements. As you may recall, sunrpc > > >> will delay a bit (via a call to schedule_timeout) then call > > >> again. This is supposed to be a rare event, and the delay > > >> is substantial. But with the array-based API, a not-fully- > > >> successful allocator call seems to happen more than half > > >> the time. Is that expected? I'm calling with GFP_KERNEL, > > >> seems like the allocator should be trying harder. > > >> > > > > > > It's not expected that the array implementation would be worse *unless* > > > you are passing in arrays with holes in the middle. Otherwise, the success > > > rate should be similar. > > > > Essentially, sunrpc will always pass an array with a hole. > > Each RPC consumes the first N elements in the rq_pages array. > > Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not > > pass in an array with more than one hole. Typically: > > > > .....PPPP > > > > My results show that, because svc_alloc_arg() ends up calling > > __alloc_pages_bulk() twice in this case, it ends up being > > twice as expensive as the list case, on average, for the same > > workload. > > > > Ok, so in this case the caller knows that holes are always at the > start. If the API returns an index that is a valid index and populated, > it can check the next index and if it is valid then the whole array > must be populated. > > Right now, the implementation checks for populated elements at the *start* > because it is required for calling prep_new_page starting at the correct > index and the API cannot make assumptions about the location of the hole. > > The patch below would check the rest of the array but note that it's > slower for the API to do this check because it has to check every element > while the sunrpc user could check one element. Let me know if a) this > hunk helps and b) is desired behaviour. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c83d38dfe936..4bf20650e5f5 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5107,6 +5107,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > } else { > while (prep_index < nr_populated) > prep_new_page(page_array[prep_index++], 0, gfp, 0); > + > + while (nr_populated < nr_pages && page_array[nr_populated]) > + nr_populated++; > } > > return nr_populated; I do know that I suggested moving prep_new_page() out of the IRQ-disabled loop, but maybe was a bad idea, for several reasons. All prep_new_page does is to write into struct page, unless some debugging stuff (like kasan) is enabled. This cache-line is hot as LRU-list update just wrote into this cache-line. As the bulk size goes up, as Matthew pointed out, this cache-line might be pushed into L2-cache, and then need to be accessed again when prep_new_page() is called. Another observation is that moving prep_new_page() into loop reduced function size with 253 bytes (which affect I-cache). ./scripts/bloat-o-meter mm/page_alloc.o-prep_new_page-outside mm/page_alloc.o-prep_new_page-inside add/remove: 18/18 grow/shrink: 0/1 up/down: 144/-397 (-253) Function old new delta __alloc_pages_bulk 1965 1712 -253 Total: Before=60799, After=60546, chg -0.42% Maybe it is better to keep prep_new_page() inside the loop. This also allows list vs array variant to share the call. And it should simplify the array variant code.
On Mon, Mar 22, 2021 at 08:32:54PM +0000, Chuck Lever III wrote: > > It's not expected that the array implementation would be worse *unless* > > you are passing in arrays with holes in the middle. Otherwise, the success > > rate should be similar. > > Essentially, sunrpc will always pass an array with a hole. > Each RPC consumes the first N elements in the rq_pages array. > Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not > pass in an array with more than one hole. Typically: > > .....PPPP > > My results show that, because svc_alloc_arg() ends up calling > __alloc_pages_bulk() twice in this case, it ends up being > twice as expensive as the list case, on average, for the same > workload. Can you call memmove() to shift all the pointers down to be the first N elements? That prevents creating a situation where we have PPPPPPPP (consume 6) ......PP (try to allocate 6, only 4 available) PPPP..PP instead, you'd do: PPPPPPPP (consume 6) PP...... (try to allocate 6, only 4 available) PPPPPP.. Alternatively, you could consume from the tail of the array instead of the head. Some CPUs aren't as effective about backwards walks as they are for forwards walks, but let's keep the pressure on CPU manufacturers to make better CPUs.
On Tue, Mar 23, 2021 at 12:08:51PM +0100, Jesper Dangaard Brouer wrote: > > > <SNIP> > > > My results show that, because svc_alloc_arg() ends up calling > > > __alloc_pages_bulk() twice in this case, it ends up being > > > twice as expensive as the list case, on average, for the same > > > workload. > > > > > > > Ok, so in this case the caller knows that holes are always at the > > start. If the API returns an index that is a valid index and populated, > > it can check the next index and if it is valid then the whole array > > must be populated. > > > > <SNIP> > > I do know that I suggested moving prep_new_page() out of the > IRQ-disabled loop, but maybe was a bad idea, for several reasons. > > All prep_new_page does is to write into struct page, unless some > debugging stuff (like kasan) is enabled. This cache-line is hot as > LRU-list update just wrote into this cache-line. As the bulk size goes > up, as Matthew pointed out, this cache-line might be pushed into > L2-cache, and then need to be accessed again when prep_new_page() is > called. > > Another observation is that moving prep_new_page() into loop reduced > function size with 253 bytes (which affect I-cache). > > ./scripts/bloat-o-meter mm/page_alloc.o-prep_new_page-outside mm/page_alloc.o-prep_new_page-inside > add/remove: 18/18 grow/shrink: 0/1 up/down: 144/-397 (-253) > Function old new delta > __alloc_pages_bulk 1965 1712 -253 > Total: Before=60799, After=60546, chg -0.42% > > Maybe it is better to keep prep_new_page() inside the loop. This also > allows list vs array variant to share the call. And it should simplify > the array variant code. > I agree. I did not like the level of complexity it incurred for arrays or the fact it required that a list to be empty when alloc_pages_bulk() is called. I thought the concern for calling prep_new_page() with IRQs disabled was a little overblown but did not feel strongly enough to push back on it hard given that we've had problems with IRQs being disabled for long periods before. At worst, at some point in the future we'll have to cap the number of pages that can be requested or enable/disable IRQs every X pages. New candidate git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r4 Interface is still the same so a rebase should be trivial. Diff between v6r2 and v6r4 is as follows. I like the diffstat if nothing else :P mm/page_alloc.c | 54 +++++++++++++----------------------------------------- 1 file changed, 13 insertions(+), 41 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 547a84f11310..be1e33a4df39 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4999,25 +4999,20 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, struct alloc_context ac; gfp_t alloc_gfp; unsigned int alloc_flags; - int nr_populated = 0, prep_index = 0; - bool hole = false; + int nr_populated = 0; if (WARN_ON_ONCE(nr_pages <= 0)) return 0; - if (WARN_ON_ONCE(page_list && !list_empty(page_list))) - return 0; - - /* Skip populated array elements. */ - if (page_array) { - while (nr_populated < nr_pages && page_array[nr_populated]) - nr_populated++; - if (nr_populated == nr_pages) - return nr_populated; - prep_index = nr_populated; - } + /* + * Skip populated array elements to determine if any pages need + * to be allocated before disabling IRQs. + */ + while (page_array && page_array[nr_populated] && nr_populated < nr_pages) + nr_populated++; - if (nr_pages == 1) + /* Use the single page allocator for one page. */ + if (nr_pages - nr_populated == 1) goto failed; /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */ @@ -5056,22 +5051,17 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, if (!zone) goto failed; -retry_hole: /* Attempt the batch allocation */ local_irq_save(flags); pcp = &this_cpu_ptr(zone->pageset)->pcp; pcp_list = &pcp->lists[ac.migratetype]; while (nr_populated < nr_pages) { - /* - * Stop allocating if the next index has a populated - * page or the page will be prepared a second time when - * IRQs are enabled. - */ + + /* Skip existing pages */ if (page_array && page_array[nr_populated]) { - hole = true; nr_populated++; - break; + continue; } page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags, @@ -5092,6 +5082,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, __count_zid_vm_events(PGALLOC, zone_idx(zone), 1); zone_statistics(ac.preferred_zoneref->zone, zone); + prep_new_page(page, 0, gfp, 0); if (page_list) list_add(&page->lru, page_list); else @@ -5101,25 +5092,6 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, local_irq_restore(flags); - /* Prep pages with IRQs enabled. */ - if (page_list) { - list_for_each_entry(page, page_list, lru) - prep_new_page(page, 0, gfp, 0); - } else { - while (prep_index < nr_populated) - prep_new_page(page_array[prep_index++], 0, gfp, 0); - - /* - * If the array is sparse, check whether the array is - * now fully populated. Continue allocations if - * necessary. - */ - while (nr_populated < nr_pages && page_array[nr_populated]) - nr_populated++; - if (hole && nr_populated < nr_pages) - goto retry_hole; - } - return nr_populated; failed_irq:
On Tue, 23 Mar 2021 10:44:21 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote: > > This series is based on top of Matthew Wilcox's series "Rationalise > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > > test and are not using Andrew's tree as a baseline, I suggest using the > > following git tree > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9 > > > > Jesper and Chuck, would you mind rebasing on top of the following branch > please? > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2 > > The interface is the same so the rebase should be trivial. > > Jesper, I'm hoping you see no differences in performance but it's best > to check. I will rebase and check again. The current performance tests that I'm running, I observe that the compiler layout the code in unfortunate ways, which cause I-cache performance issues. I wonder if you could integrate below patch with your patchset? (just squash it)
On Tue, Mar 23, 2021 at 04:08:14PM +0100, Jesper Dangaard Brouer wrote: > On Tue, 23 Mar 2021 10:44:21 +0000 > Mel Gorman <mgorman@techsingularity.net> wrote: > > > On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote: > > > This series is based on top of Matthew Wilcox's series "Rationalise > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > > > test and are not using Andrew's tree as a baseline, I suggest using the > > > following git tree > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9 > > > > > > > Jesper and Chuck, would you mind rebasing on top of the following branch > > please? > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2 > > > > The interface is the same so the rebase should be trivial. > > > > Jesper, I'm hoping you see no differences in performance but it's best > > to check. > > I will rebase and check again. > > The current performance tests that I'm running, I observe that the > compiler layout the code in unfortunate ways, which cause I-cache > performance issues. I wonder if you could integrate below patch with > your patchset? (just squash it) > Yes but I'll keep it as a separate patch that is modified slightly. Otherwise it might get "fixed" as likely/unlikely has been used inappropriately in the past. If there is pushback, I'll squash them together. > From: Jesper Dangaard Brouer <brouer@redhat.com> > > Looking at perf-report and ASM-code for __alloc_pages_bulk() then the code > activated is suboptimal. The compiler guess wrong and place unlikely code in > the beginning. Due to the use of WARN_ON_ONCE() macro the UD2 asm > instruction is added to the code, which confuse the I-cache prefetcher in > the CPU > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > --- > mm/page_alloc.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index f60f51a97a7b..88a5c1ce5b87 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5003,10 +5003,10 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > unsigned int alloc_flags; > int nr_populated = 0, prep_index = 0; > > - if (WARN_ON_ONCE(nr_pages <= 0)) > + if (unlikely(nr_pages <= 0)) > return 0; > Ok, I can make this change. It was a defensive check for the new callers in case insane values were being passed in. > - if (WARN_ON_ONCE(page_list && !list_empty(page_list))) > + if (unlikely(page_list && !list_empty(page_list))) > return 0; > > /* Skip populated array elements. */ FWIW, this check is now gone. The list only had to be empty if prep_new_page was deferred until IRQs were enabled to avoid accidentally calling prep_new_page() on a page that was already on the list when alloc_pages_bulk was called. > @@ -5018,7 +5018,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > prep_index = nr_populated; > } > > - if (nr_pages == 1) > + if (unlikely(nr_pages == 1)) > goto failed; > > /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */ I'm dropping this because nr_pages == 1 is common for the sunrpc user. > @@ -5054,7 +5054,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > * If there are no allowed local zones that meets the watermarks then > * try to allocate a single page and reclaim if necessary. > */ > - if (!zone) > + if (unlikely(!zone)) > goto failed; > > /* Attempt the batch allocation */ Ok. > @@ -5075,7 +5075,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > > page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags, > pcp, pcp_list); > - if (!page) { > + if (unlikely(!page)) { > /* Try and get at least one page */ > if (!nr_populated) > goto failed_irq; Hmmm, ok. It depends on memory pressure but I agree !page is unlikely. Current version applied is --8<-- mm/page_alloc: optimize code layout for __alloc_pages_bulk From: Jesper Dangaard Brouer <brouer@redhat.com> Looking at perf-report and ASM-code for __alloc_pages_bulk() it is clear that the code activated is suboptimal. The compiler guesses wrong and places unlikely code at the beginning. Due to the use of WARN_ON_ONCE() macro the UD2 asm instruction is added to the code, which confuse the I-cache prefetcher in the CPU. [mgorman: Minor changes and rebasing] Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index be1e33a4df39..1ec18121268b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5001,7 +5001,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, unsigned int alloc_flags; int nr_populated = 0; - if (WARN_ON_ONCE(nr_pages <= 0)) + if (unlikely(nr_pages <= 0)) return 0; /* @@ -5048,7 +5048,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, * If there are no allowed local zones that meets the watermarks then * try to allocate a single page and reclaim if necessary. */ - if (!zone) + if (unlikely(!zone)) goto failed; /* Attempt the batch allocation */ @@ -5066,7 +5066,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags, pcp, pcp_list); - if (!page) { + if (unlikely(!page)) { /* Try and get at least one page */ if (!nr_populated) goto failed_irq;
On Tue, 23 Mar 2021 16:08:14 +0100 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Tue, 23 Mar 2021 10:44:21 +0000 > Mel Gorman <mgorman@techsingularity.net> wrote: > > > On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote: > > > This series is based on top of Matthew Wilcox's series "Rationalise > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > > > test and are not using Andrew's tree as a baseline, I suggest using the > > > following git tree > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9 > > > I've pushed my benchmarks notes for this branch mm-bulk-rebase-v5r9: [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#test-on-mel-git-tree-mm-bulk-rebase-v5r9 > > Jesper and Chuck, would you mind rebasing on top of the following branch > > please? > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2 I've rebase on mm-bulk-rebase-v6r4 tomorrow.
> On Mar 23, 2021, at 10:45 AM, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Mar 23, 2021 at 12:08:51PM +0100, Jesper Dangaard Brouer wrote: >>>> <SNIP> >>>> My results show that, because svc_alloc_arg() ends up calling >>>> __alloc_pages_bulk() twice in this case, it ends up being >>>> twice as expensive as the list case, on average, for the same >>>> workload. >>>> >>> >>> Ok, so in this case the caller knows that holes are always at the >>> start. If the API returns an index that is a valid index and populated, >>> it can check the next index and if it is valid then the whole array >>> must be populated. >>> >>> <SNIP> >> >> I do know that I suggested moving prep_new_page() out of the >> IRQ-disabled loop, but maybe was a bad idea, for several reasons. >> >> All prep_new_page does is to write into struct page, unless some >> debugging stuff (like kasan) is enabled. This cache-line is hot as >> LRU-list update just wrote into this cache-line. As the bulk size goes >> up, as Matthew pointed out, this cache-line might be pushed into >> L2-cache, and then need to be accessed again when prep_new_page() is >> called. >> >> Another observation is that moving prep_new_page() into loop reduced >> function size with 253 bytes (which affect I-cache). >> >> ./scripts/bloat-o-meter mm/page_alloc.o-prep_new_page-outside mm/page_alloc.o-prep_new_page-inside >> add/remove: 18/18 grow/shrink: 0/1 up/down: 144/-397 (-253) >> Function old new delta >> __alloc_pages_bulk 1965 1712 -253 >> Total: Before=60799, After=60546, chg -0.42% >> >> Maybe it is better to keep prep_new_page() inside the loop. This also >> allows list vs array variant to share the call. And it should simplify >> the array variant code. >> > > I agree. I did not like the level of complexity it incurred for arrays > or the fact it required that a list to be empty when alloc_pages_bulk() > is called. I thought the concern for calling prep_new_page() with IRQs > disabled was a little overblown but did not feel strongly enough to push > back on it hard given that we've had problems with IRQs being disabled > for long periods before. At worst, at some point in the future we'll have > to cap the number of pages that can be requested or enable/disable IRQs > every X pages. > > New candidate > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r4 I have rebased the SUNRPC patches to v6r4. Testing has shown a minor functional regression, which I'm chasing down. But performance is in the same ballpark. FYI > Interface is still the same so a rebase should be trivial. Diff between > v6r2 and v6r4 is as follows. I like the diffstat if nothing else :P > > > mm/page_alloc.c | 54 +++++++++++++----------------------------------------- > 1 file changed, 13 insertions(+), 41 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 547a84f11310..be1e33a4df39 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4999,25 +4999,20 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > struct alloc_context ac; > gfp_t alloc_gfp; > unsigned int alloc_flags; > - int nr_populated = 0, prep_index = 0; > - bool hole = false; > + int nr_populated = 0; > > if (WARN_ON_ONCE(nr_pages <= 0)) > return 0; > > - if (WARN_ON_ONCE(page_list && !list_empty(page_list))) > - return 0; > - > - /* Skip populated array elements. */ > - if (page_array) { > - while (nr_populated < nr_pages && page_array[nr_populated]) > - nr_populated++; > - if (nr_populated == nr_pages) > - return nr_populated; > - prep_index = nr_populated; > - } > + /* > + * Skip populated array elements to determine if any pages need > + * to be allocated before disabling IRQs. > + */ > + while (page_array && page_array[nr_populated] && nr_populated < nr_pages) > + nr_populated++; > > - if (nr_pages == 1) > + /* Use the single page allocator for one page. */ > + if (nr_pages - nr_populated == 1) > goto failed; > > /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */ > @@ -5056,22 +5051,17 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > if (!zone) > goto failed; > > -retry_hole: > /* Attempt the batch allocation */ > local_irq_save(flags); > pcp = &this_cpu_ptr(zone->pageset)->pcp; > pcp_list = &pcp->lists[ac.migratetype]; > > while (nr_populated < nr_pages) { > - /* > - * Stop allocating if the next index has a populated > - * page or the page will be prepared a second time when > - * IRQs are enabled. > - */ > + > + /* Skip existing pages */ > if (page_array && page_array[nr_populated]) { > - hole = true; > nr_populated++; > - break; > + continue; > } > > page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags, > @@ -5092,6 +5082,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > __count_zid_vm_events(PGALLOC, zone_idx(zone), 1); > zone_statistics(ac.preferred_zoneref->zone, zone); > > + prep_new_page(page, 0, gfp, 0); > if (page_list) > list_add(&page->lru, page_list); > else > @@ -5101,25 +5092,6 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > > local_irq_restore(flags); > > - /* Prep pages with IRQs enabled. */ > - if (page_list) { > - list_for_each_entry(page, page_list, lru) > - prep_new_page(page, 0, gfp, 0); > - } else { > - while (prep_index < nr_populated) > - prep_new_page(page_array[prep_index++], 0, gfp, 0); > - > - /* > - * If the array is sparse, check whether the array is > - * now fully populated. Continue allocations if > - * necessary. > - */ > - while (nr_populated < nr_pages && page_array[nr_populated]) > - nr_populated++; > - if (hole && nr_populated < nr_pages) > - goto retry_hole; > - } > - > return nr_populated; > > failed_irq: > > -- > Mel Gorman > SUSE Labs -- Chuck Lever