diff mbox series

[net-next,v12,01/14] mm: page_frag: add a test module for page_frag

Message ID 20240731124505.2903877-2-linyunsheng@huawei.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Replace page_frag with page_frag_cache for sk_page_frag() | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 42 this patch: 42
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 2 of 2 maintainers
netdev/build_clang success Errors and warnings before: 44 this patch: 44
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api fail Found: 'module_param' was: 0 now: 5
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 49 this patch: 49
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 85 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Yunsheng Lin July 31, 2024, 12:44 p.m. UTC
Basing on the lib/objpool.c, change it to something like a
ptrpool, so that we can utilize that to test the correctness
and performance of the page_frag.

The testing is done by ensuring that the fragment allocated
from a frag_frag_cache instance is pushed into a ptrpool
instance in a kthread binded to a specified cpu, and a kthread
binded to a specified cpu will pop the fragment from the
ptrpool and free the fragment.

We may refactor out the common part between objpool and ptrpool
if this ptrpool thing turns out to be helpful for other place.

CC: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 mm/Kconfig          |   8 +
 mm/Makefile         |   1 +
 mm/page_frag_test.c | 393 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 402 insertions(+)
 create mode 100644 mm/page_frag_test.c

Comments

Alexander Duyck July 31, 2024, 6:29 p.m. UTC | #1
On Wed, Jul 31, 2024 at 5:50 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> Basing on the lib/objpool.c, change it to something like a
> ptrpool, so that we can utilize that to test the correctness
> and performance of the page_frag.
>
> The testing is done by ensuring that the fragment allocated
> from a frag_frag_cache instance is pushed into a ptrpool
> instance in a kthread binded to a specified cpu, and a kthread
> binded to a specified cpu will pop the fragment from the
> ptrpool and free the fragment.
>
> We may refactor out the common part between objpool and ptrpool
> if this ptrpool thing turns out to be helpful for other place.

This isn't a patch where you should be introducing stuff you hope to
refactor out and reuse later. Your objpoo/ptrpool stuff is just going
to add bloat and overhead as you are going to have to do pointer
changes to get them in and out of memory and you are having to scan
per-cpu lists. You would be better served using a simple array as your
threads should be stick to a consistent CPU anyway in terms of
testing.

I would suggest keeping this much more simple. Trying to pattern this
after something like the dmapool_test code would be a better way to go
for this. We don't need all this extra objpool overhead getting in the
way of testing the code you should be focused on. Just allocate your
array on one specific CPU and start placing and removing your pages
from there instead of messing with the push/pop semantics.

Lastly something that is a module only tester that always fails to
probe doesn't sound like it really makes sense as a standard kernel
module. I still think it would make more sense to move it to the
selftests tree and just have it build there as a module instead of
trying to force it into the mm tree. The example of dmapool_test makes
sense as it could be run at early boot to run the test and then it
just goes quiet. This module won't load and will always just return
-EAGAIN which doesn't sound like a valid kernel module to me.
Yunsheng Lin Aug. 1, 2024, 12:58 p.m. UTC | #2
On 2024/8/1 2:29, Alexander Duyck wrote:
> On Wed, Jul 31, 2024 at 5:50 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> Basing on the lib/objpool.c, change it to something like a
>> ptrpool, so that we can utilize that to test the correctness
>> and performance of the page_frag.
>>
>> The testing is done by ensuring that the fragment allocated
>> from a frag_frag_cache instance is pushed into a ptrpool
>> instance in a kthread binded to a specified cpu, and a kthread
>> binded to a specified cpu will pop the fragment from the
>> ptrpool and free the fragment.
>>
>> We may refactor out the common part between objpool and ptrpool
>> if this ptrpool thing turns out to be helpful for other place.
> 
> This isn't a patch where you should be introducing stuff you hope to
> refactor out and reuse later. Your objpoo/ptrpool stuff is just going
> to add bloat and overhead as you are going to have to do pointer
> changes to get them in and out of memory and you are having to scan
> per-cpu lists. You would be better served using a simple array as your
> threads should be stick to a consistent CPU anyway in terms of
> testing.
> 
> I would suggest keeping this much more simple. Trying to pattern this
> after something like the dmapool_test code would be a better way to go
> for this. We don't need all this extra objpool overhead getting in the
> way of testing the code you should be focused on. Just allocate your
> array on one specific CPU and start placing and removing your pages
> from there instead of messing with the push/pop semantics.

I am not sure if I understand what you meant here, do you meant something
like dmapool_test_alloc() does as something like below?

static int page_frag_test_alloc(void **p, int blocks)
{
	int i;

	for (i = 0; i < blocks; i++) {
		p[i] = page_frag_alloc(&test_frag, test_alloc_len, GFP_KERNEL);

		if (!p[i])
			goto pool_fail;
	}

	for (i = 0; i < blocks; i++)
		page_frag_free(p[i]);

	....
}

The above was my initial thinking too, I went to the ptrpool thing using
at least two CPUs as the below reason:
1. Test the concurrent calling between allocing and freeing more throughly,
   for example, page->_refcount concurrent handling, cache draining and
   cache reusing code path will be tested more throughly.
2. Test the performance impact of cache bouncing between different CPUs.

I am not sure if there is a more lightweight implementation than ptrpool
to do the above testing more throughly.

> 
> Lastly something that is a module only tester that always fails to
> probe doesn't sound like it really makes sense as a standard kernel

I had the same feeling as you, but when doing testing, it seems
convenient enough to do a 'insmod xxx.ko' for testing without a
'rmmod xxx.ko'

> module. I still think it would make more sense to move it to the
> selftests tree and just have it build there as a module instead of

I failed to find one example of test kernel module that is in the
selftests tree yet. If it does make sense, please provide an example
here, and I am willing to follow the pattern if there is one.

> trying to force it into the mm tree. The example of dmapool_test makes
> sense as it could be run at early boot to run the test and then it

I suppose you meant dmapool is built-in to the kernel and run at early
boot? I am not sure what is the point of built-in for dmapool, as it
only do one-time testing, and built-in for dmapool only waste some
memory when testing is done.

> just goes quiet. This module won't load and will always just return
> -EAGAIN which doesn't sound like a valid kernel module to me.

As above, it seems convenient enough to do a 'insmod xxx.ko' for testing
without a 'rmmod xxx.ko'.
Alexander Duyck Aug. 1, 2024, 2:50 p.m. UTC | #3
On Thu, Aug 1, 2024 at 5:58 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/8/1 2:29, Alexander Duyck wrote:
> > On Wed, Jul 31, 2024 at 5:50 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> Basing on the lib/objpool.c, change it to something like a
> >> ptrpool, so that we can utilize that to test the correctness
> >> and performance of the page_frag.
> >>
> >> The testing is done by ensuring that the fragment allocated
> >> from a frag_frag_cache instance is pushed into a ptrpool
> >> instance in a kthread binded to a specified cpu, and a kthread
> >> binded to a specified cpu will pop the fragment from the
> >> ptrpool and free the fragment.
> >>
> >> We may refactor out the common part between objpool and ptrpool
> >> if this ptrpool thing turns out to be helpful for other place.
> >
> > This isn't a patch where you should be introducing stuff you hope to
> > refactor out and reuse later. Your objpoo/ptrpool stuff is just going
> > to add bloat and overhead as you are going to have to do pointer
> > changes to get them in and out of memory and you are having to scan
> > per-cpu lists. You would be better served using a simple array as your
> > threads should be stick to a consistent CPU anyway in terms of
> > testing.
> >
> > I would suggest keeping this much more simple. Trying to pattern this
> > after something like the dmapool_test code would be a better way to go
> > for this. We don't need all this extra objpool overhead getting in the
> > way of testing the code you should be focused on. Just allocate your
> > array on one specific CPU and start placing and removing your pages
> > from there instead of messing with the push/pop semantics.
>
> I am not sure if I understand what you meant here, do you meant something
> like dmapool_test_alloc() does as something like below?
>
> static int page_frag_test_alloc(void **p, int blocks)
> {
>         int i;
>
>         for (i = 0; i < blocks; i++) {
>                 p[i] = page_frag_alloc(&test_frag, test_alloc_len, GFP_KERNEL);
>
>                 if (!p[i])
>                         goto pool_fail;
>         }
>
>         for (i = 0; i < blocks; i++)
>                 page_frag_free(p[i]);
>
>         ....
> }
>
> The above was my initial thinking too, I went to the ptrpool thing using
> at least two CPUs as the below reason:
> 1. Test the concurrent calling between allocing and freeing more throughly,
>    for example, page->_refcount concurrent handling, cache draining and
>    cache reusing code path will be tested more throughly.
> 2. Test the performance impact of cache bouncing between different CPUs.
>
> I am not sure if there is a more lightweight implementation than ptrpool
> to do the above testing more throughly.

You can still do that with a single producer single consumer ring
buffer/array and not have to introduce a ton of extra overhead for
some push/pop approach. There are a number of different
implementations for such things throughout the kernel.

>
> >
> > Lastly something that is a module only tester that always fails to
> > probe doesn't sound like it really makes sense as a standard kernel
>
> I had the same feeling as you, but when doing testing, it seems
> convenient enough to do a 'insmod xxx.ko' for testing without a
> 'rmmod xxx.ko'

It means this isn't a viable module though. If it supports insmod to
trigger your tests you should let it succeed, and then do a rmmod to
remove it afterwards. Otherwise it is a test module and belongs in the
selftest block.

> > module. I still think it would make more sense to move it to the
> > selftests tree and just have it build there as a module instead of
>
> I failed to find one example of test kernel module that is in the
> selftests tree yet. If it does make sense, please provide an example
> here, and I am willing to follow the pattern if there is one.

You must not have been looking very hard. A quick grep for
"module_init" in the selftest folder comes up with
"tools/testing/selftests/bpf/bpf_testmod/" containing an example of a
module built in the selftests folder.

> > trying to force it into the mm tree. The example of dmapool_test makes
> > sense as it could be run at early boot to run the test and then it
>
> I suppose you meant dmapool is built-in to the kernel and run at early
> boot? I am not sure what is the point of built-in for dmapool, as it
> only do one-time testing, and built-in for dmapool only waste some
> memory when testing is done.

There are cases where one might want to test on a system w/o console
access such as an embedded system, or in the case of an environment
where people run without an initrd at all.

> > just goes quiet. This module won't load and will always just return
> > -EAGAIN which doesn't sound like a valid kernel module to me.
>
> As above, it seems convenient enough to do a 'insmod xxx.ko' for testing
> without a 'rmmod xxx.ko'.

It is, but it isn't. The problem is it creates a bunch of ugliness in
the build as you are a tristate that isn't a tristate as you are only
building it if it is set to "m". There isn't anything like that
currently in the mm tree.
Yunsheng Lin Aug. 2, 2024, 10:02 a.m. UTC | #4
On 2024/8/1 22:50, Alexander Duyck wrote:

>>
>> The above was my initial thinking too, I went to the ptrpool thing using
>> at least two CPUs as the below reason:
>> 1. Test the concurrent calling between allocing and freeing more throughly,
>>    for example, page->_refcount concurrent handling, cache draining and
>>    cache reusing code path will be tested more throughly.
>> 2. Test the performance impact of cache bouncing between different CPUs.
>>
>> I am not sure if there is a more lightweight implementation than ptrpool
>> to do the above testing more throughly.
> 
> You can still do that with a single producer single consumer ring
> buffer/array and not have to introduce a ton of extra overhead for
> some push/pop approach. There are a number of different
> implementations for such things throughout the kernel.

if we limit that to single producer single consumer, it seems we can
use ptr_ring to replace ptrpool.

> 
>>
>>>
>>> Lastly something that is a module only tester that always fails to
>>> probe doesn't sound like it really makes sense as a standard kernel
>>
>> I had the same feeling as you, but when doing testing, it seems
>> convenient enough to do a 'insmod xxx.ko' for testing without a
>> 'rmmod xxx.ko'
> 
> It means this isn't a viable module though. If it supports insmod to
> trigger your tests you should let it succeed, and then do a rmmod to
> remove it afterwards. Otherwise it is a test module and belongs in the
> selftest block.
> 
>>> module. I still think it would make more sense to move it to the
>>> selftests tree and just have it build there as a module instead of
>>
>> I failed to find one example of test kernel module that is in the
>> selftests tree yet. If it does make sense, please provide an example
>> here, and I am willing to follow the pattern if there is one.
> 
> You must not have been looking very hard. A quick grep for
> "module_init" in the selftest folder comes up with
> "tools/testing/selftests/bpf/bpf_testmod/" containing an example of a
> module built in the selftests folder.

After close look, it seems it will be treated as third party module when
adding a kernel module in tools/testing/selftests as there seems to be no
config for it in Kconfig file and can only be compiled as a module not as
built-in.

> 
>>> trying to force it into the mm tree. The example of dmapool_test makes
>>> sense as it could be run at early boot to run the test and then it
>>
>> I suppose you meant dmapool is built-in to the kernel and run at early
>> boot? I am not sure what is the point of built-in for dmapool, as it
>> only do one-time testing, and built-in for dmapool only waste some
>> memory when testing is done.
> 
> There are cases where one might want to test on a system w/o console
> access such as an embedded system, or in the case of an environment
> where people run without an initrd at all.

I think moving it to tools/testing/selftests may defeat the above purpose.

> 
>>> just goes quiet. This module won't load and will always just return
>>> -EAGAIN which doesn't sound like a valid kernel module to me.
>>
>> As above, it seems convenient enough to do a 'insmod xxx.ko' for testing
>> without a 'rmmod xxx.ko'.
> 
> It is, but it isn't. The problem is it creates a bunch of ugliness in

Yes, it seems a bit ugly, but it supports the below perf cmd, I really
would like to support the below case as it is very convenient.

perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17

> the build as you are a tristate that isn't a tristate as you are only
> building it if it is set to "m". There isn't anything like that
> currently in the mm tree.

After moving page_frag_test to selftest, it is only bulit as module, I guess
it is ok to return -EAGAIN?
Alexander Duyck Aug. 2, 2024, 4:42 p.m. UTC | #5
On Fri, Aug 2, 2024 at 3:02 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/8/1 22:50, Alexander Duyck wrote:
>
> >>
> >> The above was my initial thinking too, I went to the ptrpool thing using
> >> at least two CPUs as the below reason:
> >> 1. Test the concurrent calling between allocing and freeing more throughly,
> >>    for example, page->_refcount concurrent handling, cache draining and
> >>    cache reusing code path will be tested more throughly.
> >> 2. Test the performance impact of cache bouncing between different CPUs.
> >>
> >> I am not sure if there is a more lightweight implementation than ptrpool
> >> to do the above testing more throughly.
> >
> > You can still do that with a single producer single consumer ring
> > buffer/array and not have to introduce a ton of extra overhead for
> > some push/pop approach. There are a number of different
> > implementations for such things throughout the kernel.
>
> if we limit that to single producer single consumer, it seems we can
> use ptr_ring to replace ptrpool.

Right. That is more or less what I was thinking.

> >
> >>
> >>>
> >>> Lastly something that is a module only tester that always fails to
> >>> probe doesn't sound like it really makes sense as a standard kernel
> >>
> >> I had the same feeling as you, but when doing testing, it seems
> >> convenient enough to do a 'insmod xxx.ko' for testing without a
> >> 'rmmod xxx.ko'
> >
> > It means this isn't a viable module though. If it supports insmod to
> > trigger your tests you should let it succeed, and then do a rmmod to
> > remove it afterwards. Otherwise it is a test module and belongs in the
> > selftest block.
> >
> >>> module. I still think it would make more sense to move it to the
> >>> selftests tree and just have it build there as a module instead of
> >>
> >> I failed to find one example of test kernel module that is in the
> >> selftests tree yet. If it does make sense, please provide an example
> >> here, and I am willing to follow the pattern if there is one.
> >
> > You must not have been looking very hard. A quick grep for
> > "module_init" in the selftest folder comes up with
> > "tools/testing/selftests/bpf/bpf_testmod/" containing an example of a
> > module built in the selftests folder.
>
> After close look, it seems it will be treated as third party module when
> adding a kernel module in tools/testing/selftests as there seems to be no
> config for it in Kconfig file and can only be compiled as a module not as
> built-in.

Right now you can't compile it as built-in anyway and you were
returning EAGAIN. If you are going that route then it makes more sense
to compile it outside of the mm tree since it isn't a valid module in
the first place.

> >
> >>> trying to force it into the mm tree. The example of dmapool_test makes
> >>> sense as it could be run at early boot to run the test and then it
> >>
> >> I suppose you meant dmapool is built-in to the kernel and run at early
> >> boot? I am not sure what is the point of built-in for dmapool, as it
> >> only do one-time testing, and built-in for dmapool only waste some
> >> memory when testing is done.
> >
> > There are cases where one might want to test on a system w/o console
> > access such as an embedded system, or in the case of an environment
> > where people run without an initrd at all.
>
> I think moving it to tools/testing/selftests may defeat the above purpose.

That is why I am suggesting either fix the module so that it can be
compiled in, or move it to selftest. The current module is not a valid
one and doesn't belong here in its current form.

> >
> >>> just goes quiet. This module won't load and will always just return
> >>> -EAGAIN which doesn't sound like a valid kernel module to me.
> >>
> >> As above, it seems convenient enough to do a 'insmod xxx.ko' for testing
> >> without a 'rmmod xxx.ko'.
> >
> > It is, but it isn't. The problem is it creates a bunch of ugliness in
>
> Yes, it seems a bit ugly, but it supports the below perf cmd, I really
> would like to support the below case as it is very convenient.
>
> perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17

That is fine. If that is the case then it should be in the selftest folder.

> > the build as you are a tristate that isn't a tristate as you are only
> > building it if it is set to "m". There isn't anything like that
> > currently in the mm tree.
>
> After moving page_frag_test to selftest, it is only bulit as module, I guess
> it is ok to return -EAGAIN?

Yes, I would be fine with it in that case.
diff mbox series

Patch

diff --git a/mm/Kconfig b/mm/Kconfig
index b72e7d040f78..305f02df7d67 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1140,6 +1140,14 @@  config DMAPOOL_TEST
 	  provide a consistent way to measure how changes to the
 	  dma_pool_alloc/free routines affect performance.
 
+config PAGE_FRAG_TEST
+	tristate "Test module for page_frag"
+	default n
+	depends on m
+	help
+	  Provides a test module that is used to test the correctness and
+	  performance of page_frag's implementation.
+
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index d2915f8c9dc0..59c587341e54 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -141,3 +141,4 @@  obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
+obj-$(CONFIG_PAGE_FRAG_TEST) += page_frag_test.o
diff --git a/mm/page_frag_test.c b/mm/page_frag_test.c
new file mode 100644
index 000000000000..cf2691f60b67
--- /dev/null
+++ b/mm/page_frag_test.c
@@ -0,0 +1,393 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test module for page_frag cache
+ *
+ * Copyright: linyunsheng@huawei.com
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/irqflags.h>
+#include <linux/cpumask.h>
+#include <linux/log2.h>
+#include <linux/completion.h>
+#include <linux/kthread.h>
+
+#define OBJPOOL_NR_OBJECT_MAX	BIT(24)
+
+struct objpool_slot {
+	u32 head;
+	u32 tail;
+	u32 last;
+	u32 mask;
+	void *entries[];
+} __packed;
+
+struct objpool_head {
+	int nr_cpus;
+	int capacity;
+	struct objpool_slot **cpu_slots;
+};
+
+/* initialize percpu objpool_slot */
+static void objpool_init_percpu_slot(struct objpool_head *pool,
+				     struct objpool_slot *slot)
+{
+	/* initialize elements of percpu objpool_slot */
+	slot->mask = pool->capacity - 1;
+}
+
+/* allocate and initialize percpu slots */
+static int objpool_init_percpu_slots(struct objpool_head *pool,
+				     int nr_objs, gfp_t gfp)
+{
+	int i;
+
+	for (i = 0; i < pool->nr_cpus; i++) {
+		struct objpool_slot *slot;
+		int size;
+
+		/* skip the cpu node which could never be present */
+		if (!cpu_possible(i))
+			continue;
+
+		size = struct_size(slot, entries, pool->capacity);
+
+		/*
+		 * here we allocate percpu-slot & objs together in a single
+		 * allocation to make it more compact, taking advantage of
+		 * warm caches and TLB hits. in default vmalloc is used to
+		 * reduce the pressure of kernel slab system. as we know,
+		 * minimal size of vmalloc is one page since vmalloc would
+		 * always align the requested size to page size
+		 */
+		if (gfp & GFP_ATOMIC)
+			slot = kmalloc_node(size, gfp, cpu_to_node(i));
+		else
+			slot = __vmalloc_node(size, sizeof(void *), gfp,
+					      cpu_to_node(i),
+					      __builtin_return_address(0));
+		if (!slot)
+			return -ENOMEM;
+
+		memset(slot, 0, size);
+		pool->cpu_slots[i] = slot;
+
+		objpool_init_percpu_slot(pool, slot);
+	}
+
+	return 0;
+}
+
+/* cleanup all percpu slots of the object pool */
+static void objpool_fini_percpu_slots(struct objpool_head *pool)
+{
+	int i;
+
+	if (!pool->cpu_slots)
+		return;
+
+	for (i = 0; i < pool->nr_cpus; i++)
+		kvfree(pool->cpu_slots[i]);
+	kfree(pool->cpu_slots);
+}
+
+/* initialize object pool and pre-allocate objects */
+static int objpool_init(struct objpool_head *pool, int nr_objs, gfp_t gfp)
+{
+	int rc, capacity, slot_size;
+
+	/* check input parameters */
+	if (nr_objs <= 0 || nr_objs > OBJPOOL_NR_OBJECT_MAX)
+		return -EINVAL;
+
+	/* calculate capacity of percpu objpool_slot */
+	capacity = roundup_pow_of_two(nr_objs);
+	if (!capacity)
+		return -EINVAL;
+
+	gfp = gfp & ~__GFP_ZERO;
+
+	/* initialize objpool pool */
+	memset(pool, 0, sizeof(struct objpool_head));
+	pool->nr_cpus = nr_cpu_ids;
+	pool->capacity = capacity;
+	slot_size = pool->nr_cpus * sizeof(struct objpool_slot *);
+	pool->cpu_slots = kzalloc(slot_size, gfp);
+	if (!pool->cpu_slots)
+		return -ENOMEM;
+
+	/* initialize per-cpu slots */
+	rc = objpool_init_percpu_slots(pool, nr_objs, gfp);
+	if (rc)
+		objpool_fini_percpu_slots(pool);
+
+	return rc;
+}
+
+/* adding object to slot, abort if the slot was already full */
+static int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
+{
+	struct objpool_slot *slot = pool->cpu_slots[cpu];
+	u32 head, tail;
+
+	/* loading tail and head as a local snapshot, tail first */
+	tail = READ_ONCE(slot->tail);
+
+	do {
+		head = READ_ONCE(slot->head);
+		/* fault caught: something must be wrong */
+		if (unlikely(tail - head >= pool->capacity))
+			return -ENOSPC;
+	} while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1));
+
+	/* now the tail position is reserved for the given obj */
+	WRITE_ONCE(slot->entries[tail & slot->mask], obj);
+	/* update sequence to make this obj available for pop() */
+	smp_store_release(&slot->last, tail + 1);
+
+	return 0;
+}
+
+/* reclaim an object to object pool */
+static int objpool_push(void *obj, struct objpool_head *pool)
+{
+	unsigned long flags;
+	int rc;
+
+	/* disable local irq to avoid preemption & interruption */
+	raw_local_irq_save(flags);
+	rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
+	raw_local_irq_restore(flags);
+
+	return rc;
+}
+
+/* try to retrieve object from slot */
+static void *objpool_try_get_slot(struct objpool_head *pool, int cpu)
+{
+	struct objpool_slot *slot = pool->cpu_slots[cpu];
+	/* load head snapshot, other cpus may change it */
+	u32 head = smp_load_acquire(&slot->head);
+
+	while (head != READ_ONCE(slot->last)) {
+		void *obj;
+
+		/*
+		 * data visibility of 'last' and 'head' could be out of
+		 * order since memory updating of 'last' and 'head' are
+		 * performed in push() and pop() independently
+		 *
+		 * before any retrieving attempts, pop() must guarantee
+		 * 'last' is behind 'head', that is to say, there must
+		 * be available objects in slot, which could be ensured
+		 * by condition 'last != head && last - head <= nr_objs'
+		 * that is equivalent to 'last - head - 1 < nr_objs' as
+		 * 'last' and 'head' are both unsigned int32
+		 */
+		if (READ_ONCE(slot->last) - head - 1 >= pool->capacity) {
+			head = READ_ONCE(slot->head);
+			continue;
+		}
+
+		/* obj must be retrieved before moving forward head */
+		obj = READ_ONCE(slot->entries[head & slot->mask]);
+
+		/* move head forward to mark it's consumption */
+		if (try_cmpxchg_release(&slot->head, &head, head + 1))
+			return obj;
+	}
+
+	return NULL;
+}
+
+/* allocate an object from object pool */
+static void *objpool_pop(struct objpool_head *pool)
+{
+	void *obj = NULL;
+	unsigned long flags;
+	int i, cpu;
+
+	/* disable local irq to avoid preemption & interruption */
+	raw_local_irq_save(flags);
+
+	cpu = raw_smp_processor_id();
+	for (i = 0; i < num_possible_cpus(); i++) {
+		obj = objpool_try_get_slot(pool, cpu);
+		if (obj)
+			break;
+		cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
+	}
+	raw_local_irq_restore(flags);
+
+	return obj;
+}
+
+/* release whole objpool forcely */
+static void objpool_free(struct objpool_head *pool)
+{
+	if (!pool->cpu_slots)
+		return;
+
+	/* release percpu slots */
+	objpool_fini_percpu_slots(pool);
+}
+
+static struct objpool_head ptr_pool;
+static int nr_objs = 512;
+static atomic_t nthreads;
+static struct completion wait;
+static struct page_frag_cache test_frag;
+
+static int nr_test = 5120000;
+module_param(nr_test, int, 0);
+MODULE_PARM_DESC(nr_test, "number of iterations to test");
+
+static bool test_align;
+module_param(test_align, bool, 0);
+MODULE_PARM_DESC(test_align, "use align API for testing");
+
+static int test_alloc_len = 2048;
+module_param(test_alloc_len, int, 0);
+MODULE_PARM_DESC(test_alloc_len, "alloc len for testing");
+
+static int test_push_cpu;
+module_param(test_push_cpu, int, 0);
+MODULE_PARM_DESC(test_push_cpu, "test cpu for pushing fragment");
+
+static int test_pop_cpu;
+module_param(test_pop_cpu, int, 0);
+MODULE_PARM_DESC(test_pop_cpu, "test cpu for popping fragment");
+
+static int page_frag_pop_thread(void *arg)
+{
+	struct objpool_head *pool = arg;
+	int nr = nr_test;
+
+	pr_info("page_frag pop test thread begins on cpu %d\n",
+		smp_processor_id());
+
+	while (nr > 0) {
+		void *obj = objpool_pop(pool);
+
+		if (obj) {
+			nr--;
+			page_frag_free(obj);
+		} else {
+			cond_resched();
+		}
+	}
+
+	if (atomic_dec_and_test(&nthreads))
+		complete(&wait);
+
+	pr_info("page_frag pop test thread exits on cpu %d\n",
+		smp_processor_id());
+
+	return 0;
+}
+
+static int page_frag_push_thread(void *arg)
+{
+	struct objpool_head *pool = arg;
+	int nr = nr_test;
+
+	pr_info("page_frag push test thread begins on cpu %d\n",
+		smp_processor_id());
+
+	while (nr > 0) {
+		void *va;
+		int ret;
+
+		if (test_align) {
+			va = page_frag_alloc_align(&test_frag, test_alloc_len,
+						   GFP_KERNEL, SMP_CACHE_BYTES);
+
+			WARN_ONCE((unsigned long)va & (SMP_CACHE_BYTES - 1),
+				  "unaligned va returned\n");
+		} else {
+			va = page_frag_alloc(&test_frag, test_alloc_len, GFP_KERNEL);
+		}
+
+		if (!va)
+			continue;
+
+		ret = objpool_push(va, pool);
+		if (ret) {
+			page_frag_free(va);
+			cond_resched();
+		} else {
+			nr--;
+		}
+	}
+
+	pr_info("page_frag push test thread exits on cpu %d\n",
+		smp_processor_id());
+
+	if (atomic_dec_and_test(&nthreads))
+		complete(&wait);
+
+	return 0;
+}
+
+static int __init page_frag_test_init(void)
+{
+	struct task_struct *tsk_push, *tsk_pop;
+	ktime_t start;
+	u64 duration;
+	int ret;
+
+	test_frag.va = NULL;
+	atomic_set(&nthreads, 2);
+	init_completion(&wait);
+
+	if (test_alloc_len > PAGE_SIZE || test_alloc_len <= 0)
+		return -EINVAL;
+
+	ret = objpool_init(&ptr_pool, nr_objs, GFP_KERNEL);
+	if (ret)
+		return ret;
+
+	tsk_push = kthread_create_on_cpu(page_frag_push_thread, &ptr_pool,
+					 test_push_cpu, "page_frag_push");
+	if (IS_ERR(tsk_push))
+		return PTR_ERR(tsk_push);
+
+	tsk_pop = kthread_create_on_cpu(page_frag_pop_thread, &ptr_pool,
+					test_pop_cpu, "page_frag_pop");
+	if (IS_ERR(tsk_pop)) {
+		kthread_stop(tsk_push);
+		return PTR_ERR(tsk_pop);
+	}
+
+	start = ktime_get();
+	wake_up_process(tsk_push);
+	wake_up_process(tsk_pop);
+
+	pr_info("waiting for test to complete\n");
+	wait_for_completion(&wait);
+
+	duration = (u64)ktime_us_delta(ktime_get(), start);
+	pr_info("%d of iterations for %s testing took: %lluus\n", nr_test,
+		test_align ? "aligned" : "non-aligned", duration);
+
+	objpool_free(&ptr_pool);
+	page_frag_cache_drain(&test_frag);
+
+	return -EAGAIN;
+}
+
+static void __exit page_frag_test_exit(void)
+{
+}
+
+module_init(page_frag_test_init);
+module_exit(page_frag_test_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Yunsheng Lin <linyunsheng@huawei.com>");
+MODULE_DESCRIPTION("Test module for page_frag");