Message ID | 20231021144317.3400916-1-chengming.zhou@linux.dev (mailing list archive) |
---|---|
Headers | show |
Series | slub: Delay freezing of CPU partial slabs | expand |
On Sat, Oct 21, 2023 at 11:43 PM <chengming.zhou@linux.dev> wrote: > > From: Chengming Zhou <zhouchengming@bytedance.com> > > Changes in RFC v2: > - Reuse PG_workingset bit to keep track of whether slub is on the > per-node partial list, as suggested by Matthew Wilcox. > - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which > is caused by leak of partial slabs when get_partial_node(). > - Add a patch to simplify acquire_slab(). > - Reorder patches a little. > - v1: https://lore.kernel.org/all/20231017154439.3036608-1-chengming.zhou@linux.dev/ I've picked [1] and tested this patch series and it passed a simple MM & slab test in 30 different SLUB configurations [2]. Also there's code coverage information [3] if you're interested :P For the series, Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Will review when I have free time ;) Thanks! [1] https://git.kerneltesting.org/slab-experimental/log/ [2] https://jenkins.kerneltesting.org/job/slab-experimental/ [3] https://coverage.kerneltesting.org/slab-experimental-6283c415/mm/index.html > Chengming Zhou (6): > slub: Keep track of whether slub is on the per-node partial list > slub: Prepare __slab_free() for unfrozen partial slab out of node > partial list > slub: Don't freeze slabs for cpu partial > slub: Simplify acquire_slab() > slub: Introduce get_cpu_partial() > slub: Optimize deactivate_slab() > > include/linux/page-flags.h | 2 + > mm/slab.h | 19 +++ > mm/slub.c | 245 +++++++++++++++++++------------------ > 3 files changed, 150 insertions(+), 116 deletions(-) > > -- > 2.20.1 >
On 10/21/23 16:43, chengming.zhou@linux.dev wrote: > From: Chengming Zhou <zhouchengming@bytedance.com> Hi! > Changes in RFC v2: > - Reuse PG_workingset bit to keep track of whether slub is on the > per-node partial list, as suggested by Matthew Wilcox. > - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which > is caused by leak of partial slabs when get_partial_node(). > - Add a patch to simplify acquire_slab(). > - Reorder patches a little. > - v1: https://lore.kernel.org/all/20231017154439.3036608-1-chengming.zhou@linux.dev/ > > 1. Problem > ========== > Now we have to freeze the slab when get from the node partial list, and > unfreeze the slab when put to the node partial list. Because we need to > rely on the node list_lock to synchronize the "frozen" bit changes. > > This implementation has some drawbacks: > > - Alloc path: twice cmpxchg_double. > It has to get some partial slabs from node when the allocator has used > up the CPU partial slabs. So it freeze the slab (one cmpxchg_double) > with node list_lock held, put those frozen slabs on its CPU partial > list. Later ___slab_alloc() will cmpxchg_double try-loop again if that > slab is picked to use. > > - Alloc path: amplified contention on node list_lock. > Since we have to synchronize the "frozen" bit changes under the node > list_lock, the contention of slab (struct page) can be transferred > to the node list_lock. On machine with many CPUs in one node, the > contention of list_lock will be amplified by all CPUs' alloc path. > > The current code has to workaround this problem by avoiding using > cmpxchg_double try-loop, which will just break and return when > contention of page encountered and the first cmpxchg_double failed. > But this workaround has its own problem. I'd note here: For more context, see 9b1ea29bc0d7 ("Revert "mm, slub: consider rest of partial list if acquire_slab() fails"") > - Free path: redundant unfreeze. > __slab_free() will freeze and cache some slabs on its partial list, > and flush them to the node partial list when exceed, which has to > unfreeze those slabs again under the node list_lock. Actually we > don't need to freeze slab on CPU partial list, in which case we > can save the unfreeze cmpxchg_double operations in flush path. > > 2. Solution > =========== > We solve these problems by leaving slabs unfrozen when moving out of > the node partial list and on CPU partial list, so "frozen" bit is 0. > > These partial slabs won't be manipulate concurrently by alloc path, > the only racer is free path, which may manipulate its list when !inuse. > So we need to introduce another synchronization way to avoid it, we > reuse PG_workingset to keep track of whether the slab is on node partial > list or not, only in that case we can manipulate the slab list. > > The slab will be delay frozen when it's picked to actively use by the > CPU, it becomes full at the same time, in which case we still need to > rely on "frozen" bit to avoid manipulating its list. So the slab will > be frozen only when activate use and be unfrozen only when deactivate. Interesting solution! I wonder if we could go a bit further and remove acquire_slab() completely. Because AFAICS even after your changes, acquire_slab() is still attempted including freezing the slab, which means still doing an cmpxchg_double under the list_lock, and now also handling the special case when it failed, but we at least filled percpu partial lists. What if we only filled the partial list without freezing, and then froze the first slab outside of the list_lock? Or more precisely, instead of returning the acquired "object" we would return the first slab removed from partial list. I think it would simplify the code a bit, and further reduce list_lock holding times. I'll also point out a few more details, but it's not a full detailed review as the suggestion above, and another for 4/5, could mean a rather significant change for v3. Thanks! > 3. Testing > ========== > We just did some simple testing on a server with 128 CPUs (2 nodes) to > compare performance for now. > > - perf bench sched messaging -g 5 -t -l 100000 > baseline RFC > 7.042s 6.966s > 7.022s 7.045s > 7.054s 6.985s > > - stress-ng --rawpkt 128 --rawpkt-ops 100000000 > baseline RFC > 2.42s 2.15s > 2.45s 2.16s > 2.44s 2.17s > > It shows above there is about 10% improvement on stress-ng rawpkt > testcase, although no much improvement on perf sched bench testcase. > > Thanks for any comment and code review! > > Chengming Zhou (6): > slub: Keep track of whether slub is on the per-node partial list > slub: Prepare __slab_free() for unfrozen partial slab out of node > partial list > slub: Don't freeze slabs for cpu partial > slub: Simplify acquire_slab() > slub: Introduce get_cpu_partial() > slub: Optimize deactivate_slab() > > include/linux/page-flags.h | 2 + > mm/slab.h | 19 +++ > mm/slub.c | 245 +++++++++++++++++++------------------ > 3 files changed, 150 insertions(+), 116 deletions(-) >
On Mon, 23 Oct 2023, Vlastimil Babka wrote: >> >> The slab will be delay frozen when it's picked to actively use by the >> CPU, it becomes full at the same time, in which case we still need to >> rely on "frozen" bit to avoid manipulating its list. So the slab will >> be frozen only when activate use and be unfrozen only when deactivate. > > Interesting solution! I wonder if we could go a bit further and remove > acquire_slab() completely. Because AFAICS even after your changes, > acquire_slab() is still attempted including freezing the slab, which means > still doing an cmpxchg_double under the list_lock, and now also handling the > special case when it failed, but we at least filled percpu partial lists. > What if we only filled the partial list without freezing, and then froze the > first slab outside of the list_lock? > > Or more precisely, instead of returning the acquired "object" we would > return the first slab removed from partial list. I think it would simplify > the code a bit, and further reduce list_lock holding times. > > I'll also point out a few more details, but it's not a full detailed review > as the suggestion above, and another for 4/5, could mean a rather > significant change for v3. This is not that easy. The frozen bit indicates that list management does not have to be done for a slab if its processed in free. If you take a slab off the list without setting that bit then something else needs to provide the information that "frozen" provided. If the frozen bit changes can be handled in a different way than with cmpxchg then that is a good optimization. For much of the frozen handling we must be holding the node list lock anyways in order to add/remove from the list. So we already have a lock that could be used to protect flag operations.
On 10/23/23 19:00, Christoph Lameter (Ampere) wrote: > On Mon, 23 Oct 2023, Vlastimil Babka wrote: > >>> >>> The slab will be delay frozen when it's picked to actively use by the >>> CPU, it becomes full at the same time, in which case we still need to >>> rely on "frozen" bit to avoid manipulating its list. So the slab will >>> be frozen only when activate use and be unfrozen only when deactivate. >> >> Interesting solution! I wonder if we could go a bit further and remove >> acquire_slab() completely. Because AFAICS even after your changes, >> acquire_slab() is still attempted including freezing the slab, which means >> still doing an cmpxchg_double under the list_lock, and now also handling the >> special case when it failed, but we at least filled percpu partial lists. >> What if we only filled the partial list without freezing, and then froze the >> first slab outside of the list_lock? >> >> Or more precisely, instead of returning the acquired "object" we would >> return the first slab removed from partial list. I think it would simplify >> the code a bit, and further reduce list_lock holding times. >> >> I'll also point out a few more details, but it's not a full detailed review >> as the suggestion above, and another for 4/5, could mean a rather >> significant change for v3. > > This is not that easy. The frozen bit indicates that list management does > not have to be done for a slab if its processed in free. If you take a > slab off the list without setting that bit then something else needs to > provide the information that "frozen" provided. Yes, that's the new slab_node_partial flag in patch 1, protected by list_lock. > If the frozen bit changes can be handled in a different way than > with cmpxchg then that is a good optimization. Frozen bit stays the same, but some scenarios can now avoid it. > For much of the frozen handling we must be holding the node list lock > anyways in order to add/remove from the list. So we already have a lock > that could be used to protect flag operations. I can see the following differences between the traditional frozen bit and the new flag: frozen bit advantage: - __slab_free() on an already-frozen slab can ignore list operations and list_lock completely frozen bit disadvantage: - acquire_slab() trying to do cmpxchg_double() under list_lock (see commit 9b1ea29bc0d7) slab_node_partial flag advantage: - we can take slabs off from node partial list without cmpxchg_double() - probably less cmpxchg_double() operations overall slab_node_partial flag disadvantage: - a __slab_free() that encouters a slab that's not frozen (but slab_node_partial flag is not set) might have to do more work, including taking the list_lock only to find out that slab_node_partial flag is false (but AFAICS that happens only when the slab becomes fully free by the free operation, thus relatively rarely). Put together, I think we might indeed get the best of both if the frozen flag is kept to use for cpu slabs, and we rely on slab_node_partial flag for cpu partial slabs, as the series does.
On Mon, 23 Oct 2023, Vlastimil Babka wrote: >> For much of the frozen handling we must be holding the node list lock >> anyways in order to add/remove from the list. So we already have a lock >> that could be used to protect flag operations. > > I can see the following differences between the traditional frozen bit and > the new flag: > > frozen bit advantage: > - __slab_free() on an already-frozen slab can ignore list operations and > list_lock completely > > frozen bit disadvantage: > - acquire_slab() trying to do cmpxchg_double() under list_lock (see commit > 9b1ea29bc0d7) Ok so a slab is frozen if either of those conditions are met. That gets a bit complicated to test for. Can we just get away with the slab_node_partial flag? The advantage with the frozen state is that it can be changed with a cmpxchg together with some other values (list pointer, counter) that need updating at free and allocation. But frozen updates are rarer so maybe its worth to completely drop the frozen bit. If both need to be updates then we would have two atomic ops. One is the cmpxchg and the other the operation on the page flag.
On 2023/10/22 22:52, Hyeonggon Yoo wrote: > On Sat, Oct 21, 2023 at 11:43 PM <chengming.zhou@linux.dev> wrote: >> >> From: Chengming Zhou <zhouchengming@bytedance.com> >> >> Changes in RFC v2: >> - Reuse PG_workingset bit to keep track of whether slub is on the >> per-node partial list, as suggested by Matthew Wilcox. >> - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which >> is caused by leak of partial slabs when get_partial_node(). >> - Add a patch to simplify acquire_slab(). >> - Reorder patches a little. >> - v1: https://lore.kernel.org/all/20231017154439.3036608-1-chengming.zhou@linux.dev/ > > I've picked [1] and tested this patch series and it passed a simple MM > & slab test > in 30 different SLUB configurations [2]. > > Also there's code coverage information [3] if you're interested :P > > For the series, > Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Thank you! These are very helpful to me, I will also do more workloads stress testing with more configurations. > > Will review when I have free time ;) Ok, thanks for your time. > Thanks! > > [1] https://git.kerneltesting.org/slab-experimental/log/ > [2] https://jenkins.kerneltesting.org/job/slab-experimental/ > [3] https://coverage.kerneltesting.org/slab-experimental-6283c415/mm/index.html > >> Chengming Zhou (6): >> slub: Keep track of whether slub is on the per-node partial list >> slub: Prepare __slab_free() for unfrozen partial slab out of node >> partial list >> slub: Don't freeze slabs for cpu partial >> slub: Simplify acquire_slab() >> slub: Introduce get_cpu_partial() >> slub: Optimize deactivate_slab() >> >> include/linux/page-flags.h | 2 + >> mm/slab.h | 19 +++ >> mm/slub.c | 245 +++++++++++++++++++------------------ >> 3 files changed, 150 insertions(+), 116 deletions(-) >> >> -- >> 2.20.1 >>
On 2023/10/23 23:46, Vlastimil Babka wrote: > On 10/21/23 16:43, chengming.zhou@linux.dev wrote: >> From: Chengming Zhou <zhouchengming@bytedance.com> > > Hi! > >> Changes in RFC v2: >> - Reuse PG_workingset bit to keep track of whether slub is on the >> per-node partial list, as suggested by Matthew Wilcox. >> - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which >> is caused by leak of partial slabs when get_partial_node(). >> - Add a patch to simplify acquire_slab(). >> - Reorder patches a little. >> - v1: https://lore.kernel.org/all/20231017154439.3036608-1-chengming.zhou@linux.dev/ >> >> 1. Problem >> ========== >> Now we have to freeze the slab when get from the node partial list, and >> unfreeze the slab when put to the node partial list. Because we need to >> rely on the node list_lock to synchronize the "frozen" bit changes. >> >> This implementation has some drawbacks: >> >> - Alloc path: twice cmpxchg_double. >> It has to get some partial slabs from node when the allocator has used >> up the CPU partial slabs. So it freeze the slab (one cmpxchg_double) >> with node list_lock held, put those frozen slabs on its CPU partial >> list. Later ___slab_alloc() will cmpxchg_double try-loop again if that >> slab is picked to use. >> >> - Alloc path: amplified contention on node list_lock. >> Since we have to synchronize the "frozen" bit changes under the node >> list_lock, the contention of slab (struct page) can be transferred >> to the node list_lock. On machine with many CPUs in one node, the >> contention of list_lock will be amplified by all CPUs' alloc path. >> >> The current code has to workaround this problem by avoiding using >> cmpxchg_double try-loop, which will just break and return when >> contention of page encountered and the first cmpxchg_double failed. >> But this workaround has its own problem. > > I'd note here: For more context, see 9b1ea29bc0d7 ("Revert "mm, slub: > consider rest of partial list if acquire_slab() fails"") Good, will add it. > >> - Free path: redundant unfreeze. >> __slab_free() will freeze and cache some slabs on its partial list, >> and flush them to the node partial list when exceed, which has to >> unfreeze those slabs again under the node list_lock. Actually we >> don't need to freeze slab on CPU partial list, in which case we >> can save the unfreeze cmpxchg_double operations in flush path. >> >> 2. Solution >> =========== >> We solve these problems by leaving slabs unfrozen when moving out of >> the node partial list and on CPU partial list, so "frozen" bit is 0. >> >> These partial slabs won't be manipulate concurrently by alloc path, >> the only racer is free path, which may manipulate its list when !inuse. >> So we need to introduce another synchronization way to avoid it, we >> reuse PG_workingset to keep track of whether the slab is on node partial >> list or not, only in that case we can manipulate the slab list. >> >> The slab will be delay frozen when it's picked to actively use by the >> CPU, it becomes full at the same time, in which case we still need to >> rely on "frozen" bit to avoid manipulating its list. So the slab will >> be frozen only when activate use and be unfrozen only when deactivate. > > Interesting solution! I wonder if we could go a bit further and remove > acquire_slab() completely. Because AFAICS even after your changes, > acquire_slab() is still attempted including freezing the slab, which means > still doing an cmpxchg_double under the list_lock, and now also handling the > special case when it failed, but we at least filled percpu partial lists. > What if we only filled the partial list without freezing, and then froze the > first slab outside of the list_lock? Good idea, we can return one slab and put other slabs to the CPU partial list. So we can remove the acquire_slab() completely and don't need to handle the fail case. The code will be cleaner, too. > > Or more precisely, instead of returning the acquired "object" we would > return the first slab removed from partial list. I think it would simplify > the code a bit, and further reduce list_lock holding times. Ok, I will do this in the next version. But I find we have to return the object in the "IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)" case, in which we need to allocate a single object under the node list_lock. Maybe we can use "struct partial_context" to return the object in this case? struct partial_context { - struct slab **slab; gfp_t flags; unsigned int orig_size; + void *object; }; Then we can change all get_partial interfaces to return a slab. Do you agree with this way? > > I'll also point out a few more details, but it's not a full detailed review > as the suggestion above, and another for 4/5, could mean a rather > significant change for v3. Thank you! > > Thanks! > >> 3. Testing >> ========== >> We just did some simple testing on a server with 128 CPUs (2 nodes) to >> compare performance for now. >> >> - perf bench sched messaging -g 5 -t -l 100000 >> baseline RFC >> 7.042s 6.966s >> 7.022s 7.045s >> 7.054s 6.985s >> >> - stress-ng --rawpkt 128 --rawpkt-ops 100000000 >> baseline RFC >> 2.42s 2.15s >> 2.45s 2.16s >> 2.44s 2.17s >> >> It shows above there is about 10% improvement on stress-ng rawpkt >> testcase, although no much improvement on perf sched bench testcase. >> >> Thanks for any comment and code review! >> >> Chengming Zhou (6): >> slub: Keep track of whether slub is on the per-node partial list >> slub: Prepare __slab_free() for unfrozen partial slab out of node >> partial list >> slub: Don't freeze slabs for cpu partial >> slub: Simplify acquire_slab() >> slub: Introduce get_cpu_partial() >> slub: Optimize deactivate_slab() >> >> include/linux/page-flags.h | 2 + >> mm/slab.h | 19 +++ >> mm/slub.c | 245 +++++++++++++++++++------------------ >> 3 files changed, 150 insertions(+), 116 deletions(-) >> >
On 10/23/23 23:05, Christoph Lameter (Ampere) wrote: > On Mon, 23 Oct 2023, Vlastimil Babka wrote: > >>> For much of the frozen handling we must be holding the node list lock >>> anyways in order to add/remove from the list. So we already have a lock >>> that could be used to protect flag operations. >> >> I can see the following differences between the traditional frozen bit and >> the new flag: >> >> frozen bit advantage: >> - __slab_free() on an already-frozen slab can ignore list operations and >> list_lock completely >> >> frozen bit disadvantage: >> - acquire_slab() trying to do cmpxchg_double() under list_lock (see commit >> 9b1ea29bc0d7) > > > Ok so a slab is frozen if either of those conditions are met. That gets a > bit complicated to test for. Can we just get away with the > slab_node_partial flag? Might be worth trying, but I'd try only as a next separate step. I think freezing the slab that becomes cpu slab (not partial cpu) still has benefits and no extra cost as that's when we're doing the cmpxchg_double anyway. And the complicated tests are confined to __slab_free() and it's not *that* bad IMHO, one condition checks for was_frozen, another for slab_test_node_partial(). > The advantage with the frozen state is that it can be changed with a > cmpxchg together with some other values (list pointer, counter) that need > updating at free and allocation. Exactly, but for taking a slab off the node partial list we don't need to deal with those, so that's where it makes sense to delay the frozen bit handling. > But frozen updates are rarer so maybe its worth to completely drop the > frozen bit. If both need to be updates then we would have two atomic ops. > One is the cmpxchg and the other the operation on the page flag. The flag update doesn't even have to be atomic as it's only done under list_lock.
On 10/24/23 04:20, Chengming Zhou wrote: > On 2023/10/23 23:46, Vlastimil Babka wrote: >> Or more precisely, instead of returning the acquired "object" we would >> return the first slab removed from partial list. I think it would simplify >> the code a bit, and further reduce list_lock holding times. > > Ok, I will do this in the next version. But I find we have to return the object > in the "IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)" case, in which > we need to allocate a single object under the node list_lock. Ah, right. > Maybe we can use "struct partial_context" to return the object in this case? > > struct partial_context { > - struct slab **slab; > gfp_t flags; > unsigned int orig_size; > + void *object; > }; > > Then we can change all get_partial interfaces to return a slab. Do you agree > with this way? Yeah, good idea! Thanks!
On 2023/10/24 05:05, Christoph Lameter (Ampere) wrote: > On Mon, 23 Oct 2023, Vlastimil Babka wrote: > >>> For much of the frozen handling we must be holding the node list lock >>> anyways in order to add/remove from the list. So we already have a lock >>> that could be used to protect flag operations. >> >> I can see the following differences between the traditional frozen bit and >> the new flag: >> >> frozen bit advantage: >> - __slab_free() on an already-frozen slab can ignore list operations and >> list_lock completely >> >> frozen bit disadvantage: >> - acquire_slab() trying to do cmpxchg_double() under list_lock (see commit >> 9b1ea29bc0d7) > > > Ok so a slab is frozen if either of those conditions are met. That gets a bit complicated to test for. Can we just get away with the slab_node_partial flag? > > The advantage with the frozen state is that it can be changed with a cmpxchg together with some other values (list pointer, counter) that need updating at free and allocation. > > But frozen updates are rarer so maybe its worth to completely drop the frozen bit. If both need to be updates then we would have two atomic ops. One is the cmpxchg and the other the operation on the page flag. > This introduced page flag bit is using non-atomic operations, which is protected by the node list_lock. As for completely dropping the "frozen" bit, I find it hard because we have the DEACTIVATE_BYPASS optimization in get_freelist(), which clear the "frozen" bit without the synchronization of node list_lock. So __slab_free() still need to rely on the "frozen" bit for CPU active slab. This patch series mainly optimize the cmpxchg cost in moving partial slabs between node partial list and CPU partial list, and alleviate the contention of node list_lock meanwhile. Thanks!
From: Chengming Zhou <zhouchengming@bytedance.com> Changes in RFC v2: - Reuse PG_workingset bit to keep track of whether slub is on the per-node partial list, as suggested by Matthew Wilcox. - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which is caused by leak of partial slabs when get_partial_node(). - Add a patch to simplify acquire_slab(). - Reorder patches a little. - v1: https://lore.kernel.org/all/20231017154439.3036608-1-chengming.zhou@linux.dev/ 1. Problem ========== Now we have to freeze the slab when get from the node partial list, and unfreeze the slab when put to the node partial list. Because we need to rely on the node list_lock to synchronize the "frozen" bit changes. This implementation has some drawbacks: - Alloc path: twice cmpxchg_double. It has to get some partial slabs from node when the allocator has used up the CPU partial slabs. So it freeze the slab (one cmpxchg_double) with node list_lock held, put those frozen slabs on its CPU partial list. Later ___slab_alloc() will cmpxchg_double try-loop again if that slab is picked to use. - Alloc path: amplified contention on node list_lock. Since we have to synchronize the "frozen" bit changes under the node list_lock, the contention of slab (struct page) can be transferred to the node list_lock. On machine with many CPUs in one node, the contention of list_lock will be amplified by all CPUs' alloc path. The current code has to workaround this problem by avoiding using cmpxchg_double try-loop, which will just break and return when contention of page encountered and the first cmpxchg_double failed. But this workaround has its own problem. - Free path: redundant unfreeze. __slab_free() will freeze and cache some slabs on its partial list, and flush them to the node partial list when exceed, which has to unfreeze those slabs again under the node list_lock. Actually we don't need to freeze slab on CPU partial list, in which case we can save the unfreeze cmpxchg_double operations in flush path. 2. Solution =========== We solve these problems by leaving slabs unfrozen when moving out of the node partial list and on CPU partial list, so "frozen" bit is 0. These partial slabs won't be manipulate concurrently by alloc path, the only racer is free path, which may manipulate its list when !inuse. So we need to introduce another synchronization way to avoid it, we reuse PG_workingset to keep track of whether the slab is on node partial list or not, only in that case we can manipulate the slab list. The slab will be delay frozen when it's picked to actively use by the CPU, it becomes full at the same time, in which case we still need to rely on "frozen" bit to avoid manipulating its list. So the slab will be frozen only when activate use and be unfrozen only when deactivate. 3. Testing ========== We just did some simple testing on a server with 128 CPUs (2 nodes) to compare performance for now. - perf bench sched messaging -g 5 -t -l 100000 baseline RFC 7.042s 6.966s 7.022s 7.045s 7.054s 6.985s - stress-ng --rawpkt 128 --rawpkt-ops 100000000 baseline RFC 2.42s 2.15s 2.45s 2.16s 2.44s 2.17s It shows above there is about 10% improvement on stress-ng rawpkt testcase, although no much improvement on perf sched bench testcase. Thanks for any comment and code review! Chengming Zhou (6): slub: Keep track of whether slub is on the per-node partial list slub: Prepare __slab_free() for unfrozen partial slab out of node partial list slub: Don't freeze slabs for cpu partial slub: Simplify acquire_slab() slub: Introduce get_cpu_partial() slub: Optimize deactivate_slab() include/linux/page-flags.h | 2 + mm/slab.h | 19 +++ mm/slub.c | 245 +++++++++++++++++++------------------ 3 files changed, 150 insertions(+), 116 deletions(-)