Message ID | 20230622085335.77010-25-zhengqi.arch@bytedance.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | use refcount+RCU method to implement lockless slab shrink | expand |
On 6/22/23 10:53, Qi Zheng wrote: > The shrinker_rwsem is a global read-write lock in > shrinkers subsystem, which protects most operations > such as slab shrink, registration and unregistration > of shrinkers, etc. This can easily cause problems in > the following cases. > > 1) When the memory pressure is high and there are many > filesystems mounted or unmounted at the same time, > slab shrink will be affected (down_read_trylock() > failed). > > Such as the real workload mentioned by Kirill Tkhai: > > ``` > One of the real workloads from my experience is start > of an overcommitted node containing many starting > containers after node crash (or many resuming containers > after reboot for kernel update). In these cases memory > pressure is huge, and the node goes round in long reclaim. > ``` > > 2) If a shrinker is blocked (such as the case mentioned > in [1]) and a writer comes in (such as mount a fs), > then this writer will be blocked and cause all > subsequent shrinker-related operations to be blocked. > > Even if there is no competitor when shrinking slab, there > may still be a problem. If we have a long shrinker list > and we do not reclaim enough memory with each shrinker, > then the down_read_trylock() may be called with high > frequency. Because of the poor multicore scalability of > atomic operations, this can lead to a significant drop > in IPC (instructions per cycle). > > We used to implement the lockless slab shrink with > SRCU [1], but then kernel test robot reported -88.8% > regression in stress-ng.ramfs.ops_per_sec test case [2], > so we reverted it [3]. > > This commit uses the refcount+RCU method [4] proposed by > by Dave Chinner to re-implement the lockless global slab > shrink. The memcg slab shrink is handled in the subsequent > patch. > > Currently, the shrinker instances can be divided into > the following three types: > > a) global shrinker instance statically defined in the kernel, > such as workingset_shadow_shrinker. > > b) global shrinker instance statically defined in the kernel > modules, such as mmu_shrinker in x86. > > c) shrinker instance embedded in other structures. > > For case a, the memory of shrinker instance is never freed. > For case b, the memory of shrinker instance will be freed > after the module is unloaded. But we will call synchronize_rcu() > in free_module() to wait for RCU read-side critical section to > exit. For case c, the memory of shrinker instance will be > dynamically freed by calling kfree_rcu(). So we can use > rcu_read_{lock,unlock}() to ensure that the shrinker instance > is valid. > > The shrinker::refcount mechanism ensures that the shrinker > instance will not be run again after unregistration. So the > structure that records the pointer of shrinker instance can be > safely freed without waiting for the RCU read-side critical > section. > > In this way, while we implement the lockless slab shrink, we > don't need to be blocked in unregister_shrinker() to wait > RCU read-side critical section. > > The following are the test results: > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & > > 1) Before applying this patchset: > > setting to a 60 second run per stressor > dispatching hogs: 9 ramfs > stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s > (secs) (secs) (secs) (real time) (usr+sys time) > ramfs 880623 60.02 7.71 226.93 14671.45 3753.09 > ramfs: > 1 System Management Interrupt > for a 60.03s run time: > 5762.40s available CPU time > 7.71s user time ( 0.13%) > 226.93s system time ( 3.94%) > 234.64s total time ( 4.07%) > load average: 8.54 3.06 2.11 > passed: 9: ramfs (9) > failed: 0 > skipped: 0 > successful run completed in 60.03s (1 min, 0.03 secs) > > 2) After applying this patchset: > > setting to a 60 second run per stressor > dispatching hogs: 9 ramfs > stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s > (secs) (secs) (secs) (real time) (usr+sys time) > ramfs 847562 60.02 7.44 230.22 14120.66 3566.23 > ramfs: > 4 System Management Interrupts > for a 60.12s run time: > 5771.95s available CPU time > 7.44s user time ( 0.13%) > 230.22s system time ( 3.99%) > 237.66s total time ( 4.12%) > load average: 8.18 2.43 0.84 > passed: 9: ramfs (9) > failed: 0 > skipped: 0 > successful run completed in 60.12s (1 min, 0.12 secs) > > We can see that the ops/s has hardly changed. > > [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/ > [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ > [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/ > [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ > > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> > --- > include/linux/shrinker.h | 6 ++++++ > mm/vmscan.c | 33 ++++++++++++++------------------- > 2 files changed, 20 insertions(+), 19 deletions(-) > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 7bfeb2f25246..b0c6c2df9db8 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -74,6 +74,7 @@ struct shrinker { > > refcount_t refcount; > struct completion completion_wait; > + struct rcu_head rcu; > > void *private_data; > > @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count, > void shrinker_free(struct shrinker *shrinker); > void unregister_and_free_shrinker(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(&shrinker->refcount); > +} > + > static inline void shrinker_put(struct shrinker *shrinker) > { > if (refcount_dec_and_test(&shrinker->refcount)) > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 6f9c4750effa..767569698946 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -57,6 +57,7 @@ > #include <linux/khugepaged.h> > #include <linux/rculist_nulls.h> > #include <linux/random.h> > +#include <linux/rculist.h> > > #include <asm/tlbflush.h> > #include <asm/div64.h> > @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker) > down_write(&shrinker_rwsem); > refcount_set(&shrinker->refcount, 1); > init_completion(&shrinker->completion_wait); > - list_add_tail(&shrinker->list, &shrinker_list); > + list_add_tail_rcu(&shrinker->list, &shrinker_list); > shrinker->flags |= SHRINKER_REGISTERED; > shrinker_debugfs_add(shrinker); > up_write(&shrinker_rwsem); > @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker) > wait_for_completion(&shrinker->completion_wait); > > down_write(&shrinker_rwsem); > - list_del(&shrinker->list); > + list_del_rcu(&shrinker->list); > shrinker->flags &= ~SHRINKER_REGISTERED; > if (shrinker->flags & SHRINKER_MEMCG_AWARE) > unregister_memcg_shrinker(shrinker); > @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free); > void unregister_and_free_shrinker(struct shrinker *shrinker) > { > unregister_shrinker(shrinker); > - kfree(shrinker); > + kfree_rcu(shrinker, rcu); > } > EXPORT_SYMBOL(unregister_and_free_shrinker); > > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(&shrinker_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, &shrinker_list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + rcu_read_unlock(); I don't think you can do this unlock? > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > - /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > - */ > - if (rwsem_is_contended(&shrinker_rwsem)) { > - freed = freed ? : 1; > - break; > - } > - } > > - up_read(&shrinker_rwsem); > -out: > + rcu_read_lock(); That new rcu_read_lock() won't help AFAIK, the whole list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be safe. IIUC this is why Dave in [4] suggests unifying shrink_slab() with shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR. > + shrinker_put(shrinker); > + } > + rcu_read_unlock(); > cond_resched(); > return freed; > }
On 2023/6/22 23:12, Vlastimil Babka wrote: > On 6/22/23 10:53, Qi Zheng wrote: >> The shrinker_rwsem is a global read-write lock in >> shrinkers subsystem, which protects most operations >> such as slab shrink, registration and unregistration >> of shrinkers, etc. This can easily cause problems in >> the following cases. >> >> 1) When the memory pressure is high and there are many >> filesystems mounted or unmounted at the same time, >> slab shrink will be affected (down_read_trylock() >> failed). >> >> Such as the real workload mentioned by Kirill Tkhai: >> >> ``` >> One of the real workloads from my experience is start >> of an overcommitted node containing many starting >> containers after node crash (or many resuming containers >> after reboot for kernel update). In these cases memory >> pressure is huge, and the node goes round in long reclaim. >> ``` >> >> 2) If a shrinker is blocked (such as the case mentioned >> in [1]) and a writer comes in (such as mount a fs), >> then this writer will be blocked and cause all >> subsequent shrinker-related operations to be blocked. >> >> Even if there is no competitor when shrinking slab, there >> may still be a problem. If we have a long shrinker list >> and we do not reclaim enough memory with each shrinker, >> then the down_read_trylock() may be called with high >> frequency. Because of the poor multicore scalability of >> atomic operations, this can lead to a significant drop >> in IPC (instructions per cycle). >> >> We used to implement the lockless slab shrink with >> SRCU [1], but then kernel test robot reported -88.8% >> regression in stress-ng.ramfs.ops_per_sec test case [2], >> so we reverted it [3]. >> >> This commit uses the refcount+RCU method [4] proposed by >> by Dave Chinner to re-implement the lockless global slab >> shrink. The memcg slab shrink is handled in the subsequent >> patch. >> >> Currently, the shrinker instances can be divided into >> the following three types: >> >> a) global shrinker instance statically defined in the kernel, >> such as workingset_shadow_shrinker. >> >> b) global shrinker instance statically defined in the kernel >> modules, such as mmu_shrinker in x86. >> >> c) shrinker instance embedded in other structures. >> >> For case a, the memory of shrinker instance is never freed. >> For case b, the memory of shrinker instance will be freed >> after the module is unloaded. But we will call synchronize_rcu() >> in free_module() to wait for RCU read-side critical section to >> exit. For case c, the memory of shrinker instance will be >> dynamically freed by calling kfree_rcu(). So we can use >> rcu_read_{lock,unlock}() to ensure that the shrinker instance >> is valid. >> >> The shrinker::refcount mechanism ensures that the shrinker >> instance will not be run again after unregistration. So the >> structure that records the pointer of shrinker instance can be >> safely freed without waiting for the RCU read-side critical >> section. >> >> In this way, while we implement the lockless slab shrink, we >> don't need to be blocked in unregister_shrinker() to wait >> RCU read-side critical section. >> >> The following are the test results: >> >> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & >> >> 1) Before applying this patchset: >> >> setting to a 60 second run per stressor >> dispatching hogs: 9 ramfs >> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s >> (secs) (secs) (secs) (real time) (usr+sys time) >> ramfs 880623 60.02 7.71 226.93 14671.45 3753.09 >> ramfs: >> 1 System Management Interrupt >> for a 60.03s run time: >> 5762.40s available CPU time >> 7.71s user time ( 0.13%) >> 226.93s system time ( 3.94%) >> 234.64s total time ( 4.07%) >> load average: 8.54 3.06 2.11 >> passed: 9: ramfs (9) >> failed: 0 >> skipped: 0 >> successful run completed in 60.03s (1 min, 0.03 secs) >> >> 2) After applying this patchset: >> >> setting to a 60 second run per stressor >> dispatching hogs: 9 ramfs >> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s >> (secs) (secs) (secs) (real time) (usr+sys time) >> ramfs 847562 60.02 7.44 230.22 14120.66 3566.23 >> ramfs: >> 4 System Management Interrupts >> for a 60.12s run time: >> 5771.95s available CPU time >> 7.44s user time ( 0.13%) >> 230.22s system time ( 3.99%) >> 237.66s total time ( 4.12%) >> load average: 8.18 2.43 0.84 >> passed: 9: ramfs (9) >> failed: 0 >> skipped: 0 >> successful run completed in 60.12s (1 min, 0.12 secs) >> >> We can see that the ops/s has hardly changed. >> >> [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/ >> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ >> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/ >> [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ >> >> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> >> --- >> include/linux/shrinker.h | 6 ++++++ >> mm/vmscan.c | 33 ++++++++++++++------------------- >> 2 files changed, 20 insertions(+), 19 deletions(-) >> >> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h >> index 7bfeb2f25246..b0c6c2df9db8 100644 >> --- a/include/linux/shrinker.h >> +++ b/include/linux/shrinker.h >> @@ -74,6 +74,7 @@ struct shrinker { >> >> refcount_t refcount; >> struct completion completion_wait; >> + struct rcu_head rcu; >> >> void *private_data; >> >> @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count, >> void shrinker_free(struct shrinker *shrinker); >> void unregister_and_free_shrinker(struct shrinker *shrinker); >> >> +static inline bool shrinker_try_get(struct shrinker *shrinker) >> +{ >> + return refcount_inc_not_zero(&shrinker->refcount); >> +} >> + >> static inline void shrinker_put(struct shrinker *shrinker) >> { >> if (refcount_dec_and_test(&shrinker->refcount)) >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 6f9c4750effa..767569698946 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -57,6 +57,7 @@ >> #include <linux/khugepaged.h> >> #include <linux/rculist_nulls.h> >> #include <linux/random.h> >> +#include <linux/rculist.h> >> >> #include <asm/tlbflush.h> >> #include <asm/div64.h> >> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker) >> down_write(&shrinker_rwsem); >> refcount_set(&shrinker->refcount, 1); >> init_completion(&shrinker->completion_wait); >> - list_add_tail(&shrinker->list, &shrinker_list); >> + list_add_tail_rcu(&shrinker->list, &shrinker_list); >> shrinker->flags |= SHRINKER_REGISTERED; >> shrinker_debugfs_add(shrinker); >> up_write(&shrinker_rwsem); >> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker) >> wait_for_completion(&shrinker->completion_wait); >> >> down_write(&shrinker_rwsem); >> - list_del(&shrinker->list); >> + list_del_rcu(&shrinker->list); >> shrinker->flags &= ~SHRINKER_REGISTERED; >> if (shrinker->flags & SHRINKER_MEMCG_AWARE) >> unregister_memcg_shrinker(shrinker); >> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free); >> void unregister_and_free_shrinker(struct shrinker *shrinker) >> { >> unregister_shrinker(shrinker); >> - kfree(shrinker); >> + kfree_rcu(shrinker, rcu); >> } >> EXPORT_SYMBOL(unregister_and_free_shrinker); >> >> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, >> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) >> return shrink_slab_memcg(gfp_mask, nid, memcg, priority); >> >> - if (!down_read_trylock(&shrinker_rwsem)) >> - goto out; >> - >> - list_for_each_entry(shrinker, &shrinker_list, list) { >> + rcu_read_lock(); >> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >> struct shrink_control sc = { >> .gfp_mask = gfp_mask, >> .nid = nid, >> .memcg = memcg, >> }; >> >> + if (!shrinker_try_get(shrinker)) >> + continue; >> + rcu_read_unlock(); > > I don't think you can do this unlock? > >> + >> ret = do_shrink_slab(&sc, shrinker, priority); >> if (ret == SHRINK_EMPTY) >> ret = 0; >> freed += ret; >> - /* >> - * Bail out if someone want to register a new shrinker to >> - * prevent the registration from being stalled for long periods >> - * by parallel ongoing shrinking. >> - */ >> - if (rwsem_is_contended(&shrinker_rwsem)) { >> - freed = freed ? : 1; >> - break; >> - } >> - } >> >> - up_read(&shrinker_rwsem); >> -out: >> + rcu_read_lock(); > > That new rcu_read_lock() won't help AFAIK, the whole > list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be > safe. In the unregister_shrinker() path, we will wait for the refcount to zero before deleting the shrinker from the linked list. Here, we first took the rcu lock, and then decrement the refcount of this shrinker. shrink_slab unregister_shrinker =========== =================== /* wait for B */ wait_for_completion() rcu_read_lock() shrinker_put() --> (B) list_del_rcu() /* wait for rcu_read_unlock() */ kfree_rcu() /* * so this shrinker will not be freed here, * and can be used to traverse the next node * normally? */ list_for_each_entry() shrinker_try_get() rcu_read_unlock() Did I miss something? > > IIUC this is why Dave in [4] suggests unifying shrink_slab() with > shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR. > >> + shrinker_put(shrinker); >> + } >> + rcu_read_unlock(); >> cond_resched(); >> return freed; >> } >
> 2023年6月23日 上午12:42,Qi Zheng <zhengqi.arch@bytedance.com> 写道: > > > > On 2023/6/22 23:12, Vlastimil Babka wrote: >> On 6/22/23 10:53, Qi Zheng wrote: >>> The shrinker_rwsem is a global read-write lock in >>> shrinkers subsystem, which protects most operations >>> such as slab shrink, registration and unregistration >>> of shrinkers, etc. This can easily cause problems in >>> the following cases. >>> >>> 1) When the memory pressure is high and there are many >>> filesystems mounted or unmounted at the same time, >>> slab shrink will be affected (down_read_trylock() >>> failed). >>> >>> Such as the real workload mentioned by Kirill Tkhai: >>> >>> ``` >>> One of the real workloads from my experience is start >>> of an overcommitted node containing many starting >>> containers after node crash (or many resuming containers >>> after reboot for kernel update). In these cases memory >>> pressure is huge, and the node goes round in long reclaim. >>> ``` >>> >>> 2) If a shrinker is blocked (such as the case mentioned >>> in [1]) and a writer comes in (such as mount a fs), >>> then this writer will be blocked and cause all >>> subsequent shrinker-related operations to be blocked. >>> >>> Even if there is no competitor when shrinking slab, there >>> may still be a problem. If we have a long shrinker list >>> and we do not reclaim enough memory with each shrinker, >>> then the down_read_trylock() may be called with high >>> frequency. Because of the poor multicore scalability of >>> atomic operations, this can lead to a significant drop >>> in IPC (instructions per cycle). >>> >>> We used to implement the lockless slab shrink with >>> SRCU [1], but then kernel test robot reported -88.8% >>> regression in stress-ng.ramfs.ops_per_sec test case [2], >>> so we reverted it [3]. >>> >>> This commit uses the refcount+RCU method [4] proposed by >>> by Dave Chinner to re-implement the lockless global slab >>> shrink. The memcg slab shrink is handled in the subsequent >>> patch. >>> >>> Currently, the shrinker instances can be divided into >>> the following three types: >>> >>> a) global shrinker instance statically defined in the kernel, >>> such as workingset_shadow_shrinker. >>> >>> b) global shrinker instance statically defined in the kernel >>> modules, such as mmu_shrinker in x86. >>> >>> c) shrinker instance embedded in other structures. >>> >>> For case a, the memory of shrinker instance is never freed. >>> For case b, the memory of shrinker instance will be freed >>> after the module is unloaded. But we will call synchronize_rcu() >>> in free_module() to wait for RCU read-side critical section to >>> exit. For case c, the memory of shrinker instance will be >>> dynamically freed by calling kfree_rcu(). So we can use >>> rcu_read_{lock,unlock}() to ensure that the shrinker instance >>> is valid. >>> >>> The shrinker::refcount mechanism ensures that the shrinker >>> instance will not be run again after unregistration. So the >>> structure that records the pointer of shrinker instance can be >>> safely freed without waiting for the RCU read-side critical >>> section. >>> >>> In this way, while we implement the lockless slab shrink, we >>> don't need to be blocked in unregister_shrinker() to wait >>> RCU read-side critical section. >>> >>> The following are the test results: >>> >>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & >>> >>> 1) Before applying this patchset: >>> >>> setting to a 60 second run per stressor >>> dispatching hogs: 9 ramfs >>> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s >>> (secs) (secs) (secs) (real time) (usr+sys time) >>> ramfs 880623 60.02 7.71 226.93 14671.45 3753.09 >>> ramfs: >>> 1 System Management Interrupt >>> for a 60.03s run time: >>> 5762.40s available CPU time >>> 7.71s user time ( 0.13%) >>> 226.93s system time ( 3.94%) >>> 234.64s total time ( 4.07%) >>> load average: 8.54 3.06 2.11 >>> passed: 9: ramfs (9) >>> failed: 0 >>> skipped: 0 >>> successful run completed in 60.03s (1 min, 0.03 secs) >>> >>> 2) After applying this patchset: >>> >>> setting to a 60 second run per stressor >>> dispatching hogs: 9 ramfs >>> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s >>> (secs) (secs) (secs) (real time) (usr+sys time) >>> ramfs 847562 60.02 7.44 230.22 14120.66 3566.23 >>> ramfs: >>> 4 System Management Interrupts >>> for a 60.12s run time: >>> 5771.95s available CPU time >>> 7.44s user time ( 0.13%) >>> 230.22s system time ( 3.99%) >>> 237.66s total time ( 4.12%) >>> load average: 8.18 2.43 0.84 >>> passed: 9: ramfs (9) >>> failed: 0 >>> skipped: 0 >>> successful run completed in 60.12s (1 min, 0.12 secs) >>> >>> We can see that the ops/s has hardly changed. >>> >>> [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/ >>> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ >>> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/ >>> [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ >>> >>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> >>> --- >>> include/linux/shrinker.h | 6 ++++++ >>> mm/vmscan.c | 33 ++++++++++++++------------------- >>> 2 files changed, 20 insertions(+), 19 deletions(-) >>> >>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h >>> index 7bfeb2f25246..b0c6c2df9db8 100644 >>> --- a/include/linux/shrinker.h >>> +++ b/include/linux/shrinker.h >>> @@ -74,6 +74,7 @@ struct shrinker { >>> refcount_t refcount; >>> struct completion completion_wait; >>> + struct rcu_head rcu; >>> void *private_data; >>> @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count, >>> void shrinker_free(struct shrinker *shrinker); >>> void unregister_and_free_shrinker(struct shrinker *shrinker); >>> +static inline bool shrinker_try_get(struct shrinker *shrinker) >>> +{ >>> + return refcount_inc_not_zero(&shrinker->refcount); >>> +} >>> + >>> static inline void shrinker_put(struct shrinker *shrinker) >>> { >>> if (refcount_dec_and_test(&shrinker->refcount)) >>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>> index 6f9c4750effa..767569698946 100644 >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -57,6 +57,7 @@ >>> #include <linux/khugepaged.h> >>> #include <linux/rculist_nulls.h> >>> #include <linux/random.h> >>> +#include <linux/rculist.h> >>> #include <asm/tlbflush.h> >>> #include <asm/div64.h> >>> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker) >>> down_write(&shrinker_rwsem); >>> refcount_set(&shrinker->refcount, 1); >>> init_completion(&shrinker->completion_wait); >>> - list_add_tail(&shrinker->list, &shrinker_list); >>> + list_add_tail_rcu(&shrinker->list, &shrinker_list); >>> shrinker->flags |= SHRINKER_REGISTERED; >>> shrinker_debugfs_add(shrinker); >>> up_write(&shrinker_rwsem); >>> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker) >>> wait_for_completion(&shrinker->completion_wait); >>> down_write(&shrinker_rwsem); >>> - list_del(&shrinker->list); >>> + list_del_rcu(&shrinker->list); >>> shrinker->flags &= ~SHRINKER_REGISTERED; >>> if (shrinker->flags & SHRINKER_MEMCG_AWARE) >>> unregister_memcg_shrinker(shrinker); >>> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free); >>> void unregister_and_free_shrinker(struct shrinker *shrinker) >>> { >>> unregister_shrinker(shrinker); >>> - kfree(shrinker); >>> + kfree_rcu(shrinker, rcu); >>> } >>> EXPORT_SYMBOL(unregister_and_free_shrinker); >>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, >>> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) >>> return shrink_slab_memcg(gfp_mask, nid, memcg, priority); >>> - if (!down_read_trylock(&shrinker_rwsem)) >>> - goto out; >>> - >>> - list_for_each_entry(shrinker, &shrinker_list, list) { >>> + rcu_read_lock(); >>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >>> struct shrink_control sc = { >>> .gfp_mask = gfp_mask, >>> .nid = nid, >>> .memcg = memcg, >>> }; >>> + if (!shrinker_try_get(shrinker)) >>> + continue; >>> + rcu_read_unlock(); >> I don't think you can do this unlock? >>> + >>> ret = do_shrink_slab(&sc, shrinker, priority); >>> if (ret == SHRINK_EMPTY) >>> ret = 0; >>> freed += ret; >>> - /* >>> - * Bail out if someone want to register a new shrinker to >>> - * prevent the registration from being stalled for long periods >>> - * by parallel ongoing shrinking. >>> - */ >>> - if (rwsem_is_contended(&shrinker_rwsem)) { >>> - freed = freed ? : 1; >>> - break; >>> - } >>> - } >>> - up_read(&shrinker_rwsem); >>> -out: >>> + rcu_read_lock(); >> That new rcu_read_lock() won't help AFAIK, the whole >> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be >> safe. > > In the unregister_shrinker() path, we will wait for the refcount to zero > before deleting the shrinker from the linked list. Here, we first took > the rcu lock, and then decrement the refcount of this shrinker. > > shrink_slab unregister_shrinker > =========== =================== > > /* wait for B */ > wait_for_completion() > rcu_read_lock() > > shrinker_put() --> (B) > list_del_rcu() > /* wait for rcu_read_unlock() */ > kfree_rcu() > > /* > * so this shrinker will not be freed here, > * and can be used to traverse the next node > * normally? > */ > list_for_each_entry() > > shrinker_try_get() > rcu_read_unlock() > > Did I miss something? After calling rcu_read_unlock(), the next shrinker in the list can be freed, so in the next iteration, use after free might happen? Is that right? > >> IIUC this is why Dave in [4] suggests unifying shrink_slab() with >> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR. >>> + shrinker_put(shrinker); >>> + } >>> + rcu_read_unlock(); >>> cond_resched(); >>> return freed; >>> }
On 2023/6/23 01:41, Alan Huang wrote: > >> 2023年6月23日 上午12:42,Qi Zheng <zhengqi.arch@bytedance.com> 写道: >> >> >> >> On 2023/6/22 23:12, Vlastimil Babka wrote: >>> On 6/22/23 10:53, Qi Zheng wrote: >>>> The shrinker_rwsem is a global read-write lock in >>>> shrinkers subsystem, which protects most operations >>>> such as slab shrink, registration and unregistration >>>> of shrinkers, etc. This can easily cause problems in >>>> the following cases. >>>> >>>> 1) When the memory pressure is high and there are many >>>> filesystems mounted or unmounted at the same time, >>>> slab shrink will be affected (down_read_trylock() >>>> failed). >>>> >>>> Such as the real workload mentioned by Kirill Tkhai: >>>> >>>> ``` >>>> One of the real workloads from my experience is start >>>> of an overcommitted node containing many starting >>>> containers after node crash (or many resuming containers >>>> after reboot for kernel update). In these cases memory >>>> pressure is huge, and the node goes round in long reclaim. >>>> ``` >>>> >>>> 2) If a shrinker is blocked (such as the case mentioned >>>> in [1]) and a writer comes in (such as mount a fs), >>>> then this writer will be blocked and cause all >>>> subsequent shrinker-related operations to be blocked. >>>> >>>> Even if there is no competitor when shrinking slab, there >>>> may still be a problem. If we have a long shrinker list >>>> and we do not reclaim enough memory with each shrinker, >>>> then the down_read_trylock() may be called with high >>>> frequency. Because of the poor multicore scalability of >>>> atomic operations, this can lead to a significant drop >>>> in IPC (instructions per cycle). >>>> >>>> We used to implement the lockless slab shrink with >>>> SRCU [1], but then kernel test robot reported -88.8% >>>> regression in stress-ng.ramfs.ops_per_sec test case [2], >>>> so we reverted it [3]. >>>> >>>> This commit uses the refcount+RCU method [4] proposed by >>>> by Dave Chinner to re-implement the lockless global slab >>>> shrink. The memcg slab shrink is handled in the subsequent >>>> patch. >>>> >>>> Currently, the shrinker instances can be divided into >>>> the following three types: >>>> >>>> a) global shrinker instance statically defined in the kernel, >>>> such as workingset_shadow_shrinker. >>>> >>>> b) global shrinker instance statically defined in the kernel >>>> modules, such as mmu_shrinker in x86. >>>> >>>> c) shrinker instance embedded in other structures. >>>> >>>> For case a, the memory of shrinker instance is never freed. >>>> For case b, the memory of shrinker instance will be freed >>>> after the module is unloaded. But we will call synchronize_rcu() >>>> in free_module() to wait for RCU read-side critical section to >>>> exit. For case c, the memory of shrinker instance will be >>>> dynamically freed by calling kfree_rcu(). So we can use >>>> rcu_read_{lock,unlock}() to ensure that the shrinker instance >>>> is valid. >>>> >>>> The shrinker::refcount mechanism ensures that the shrinker >>>> instance will not be run again after unregistration. So the >>>> structure that records the pointer of shrinker instance can be >>>> safely freed without waiting for the RCU read-side critical >>>> section. >>>> >>>> In this way, while we implement the lockless slab shrink, we >>>> don't need to be blocked in unregister_shrinker() to wait >>>> RCU read-side critical section. >>>> >>>> The following are the test results: >>>> >>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & >>>> >>>> 1) Before applying this patchset: >>>> >>>> setting to a 60 second run per stressor >>>> dispatching hogs: 9 ramfs >>>> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s >>>> (secs) (secs) (secs) (real time) (usr+sys time) >>>> ramfs 880623 60.02 7.71 226.93 14671.45 3753.09 >>>> ramfs: >>>> 1 System Management Interrupt >>>> for a 60.03s run time: >>>> 5762.40s available CPU time >>>> 7.71s user time ( 0.13%) >>>> 226.93s system time ( 3.94%) >>>> 234.64s total time ( 4.07%) >>>> load average: 8.54 3.06 2.11 >>>> passed: 9: ramfs (9) >>>> failed: 0 >>>> skipped: 0 >>>> successful run completed in 60.03s (1 min, 0.03 secs) >>>> >>>> 2) After applying this patchset: >>>> >>>> setting to a 60 second run per stressor >>>> dispatching hogs: 9 ramfs >>>> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s >>>> (secs) (secs) (secs) (real time) (usr+sys time) >>>> ramfs 847562 60.02 7.44 230.22 14120.66 3566.23 >>>> ramfs: >>>> 4 System Management Interrupts >>>> for a 60.12s run time: >>>> 5771.95s available CPU time >>>> 7.44s user time ( 0.13%) >>>> 230.22s system time ( 3.99%) >>>> 237.66s total time ( 4.12%) >>>> load average: 8.18 2.43 0.84 >>>> passed: 9: ramfs (9) >>>> failed: 0 >>>> skipped: 0 >>>> successful run completed in 60.12s (1 min, 0.12 secs) >>>> >>>> We can see that the ops/s has hardly changed. >>>> >>>> [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/ >>>> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ >>>> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/ >>>> [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ >>>> >>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> >>>> --- >>>> include/linux/shrinker.h | 6 ++++++ >>>> mm/vmscan.c | 33 ++++++++++++++------------------- >>>> 2 files changed, 20 insertions(+), 19 deletions(-) >>>> >>>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h >>>> index 7bfeb2f25246..b0c6c2df9db8 100644 >>>> --- a/include/linux/shrinker.h >>>> +++ b/include/linux/shrinker.h >>>> @@ -74,6 +74,7 @@ struct shrinker { >>>> refcount_t refcount; >>>> struct completion completion_wait; >>>> + struct rcu_head rcu; >>>> void *private_data; >>>> @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count, >>>> void shrinker_free(struct shrinker *shrinker); >>>> void unregister_and_free_shrinker(struct shrinker *shrinker); >>>> +static inline bool shrinker_try_get(struct shrinker *shrinker) >>>> +{ >>>> + return refcount_inc_not_zero(&shrinker->refcount); >>>> +} >>>> + >>>> static inline void shrinker_put(struct shrinker *shrinker) >>>> { >>>> if (refcount_dec_and_test(&shrinker->refcount)) >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>>> index 6f9c4750effa..767569698946 100644 >>>> --- a/mm/vmscan.c >>>> +++ b/mm/vmscan.c >>>> @@ -57,6 +57,7 @@ >>>> #include <linux/khugepaged.h> >>>> #include <linux/rculist_nulls.h> >>>> #include <linux/random.h> >>>> +#include <linux/rculist.h> >>>> #include <asm/tlbflush.h> >>>> #include <asm/div64.h> >>>> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker) >>>> down_write(&shrinker_rwsem); >>>> refcount_set(&shrinker->refcount, 1); >>>> init_completion(&shrinker->completion_wait); >>>> - list_add_tail(&shrinker->list, &shrinker_list); >>>> + list_add_tail_rcu(&shrinker->list, &shrinker_list); >>>> shrinker->flags |= SHRINKER_REGISTERED; >>>> shrinker_debugfs_add(shrinker); >>>> up_write(&shrinker_rwsem); >>>> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker) >>>> wait_for_completion(&shrinker->completion_wait); >>>> down_write(&shrinker_rwsem); >>>> - list_del(&shrinker->list); >>>> + list_del_rcu(&shrinker->list); >>>> shrinker->flags &= ~SHRINKER_REGISTERED; >>>> if (shrinker->flags & SHRINKER_MEMCG_AWARE) >>>> unregister_memcg_shrinker(shrinker); >>>> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free); >>>> void unregister_and_free_shrinker(struct shrinker *shrinker) >>>> { >>>> unregister_shrinker(shrinker); >>>> - kfree(shrinker); >>>> + kfree_rcu(shrinker, rcu); >>>> } >>>> EXPORT_SYMBOL(unregister_and_free_shrinker); >>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, >>>> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) >>>> return shrink_slab_memcg(gfp_mask, nid, memcg, priority); >>>> - if (!down_read_trylock(&shrinker_rwsem)) >>>> - goto out; >>>> - >>>> - list_for_each_entry(shrinker, &shrinker_list, list) { >>>> + rcu_read_lock(); >>>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >>>> struct shrink_control sc = { >>>> .gfp_mask = gfp_mask, >>>> .nid = nid, >>>> .memcg = memcg, >>>> }; >>>> + if (!shrinker_try_get(shrinker)) >>>> + continue; >>>> + rcu_read_unlock(); >>> I don't think you can do this unlock? >>>> + >>>> ret = do_shrink_slab(&sc, shrinker, priority); >>>> if (ret == SHRINK_EMPTY) >>>> ret = 0; >>>> freed += ret; >>>> - /* >>>> - * Bail out if someone want to register a new shrinker to >>>> - * prevent the registration from being stalled for long periods >>>> - * by parallel ongoing shrinking. >>>> - */ >>>> - if (rwsem_is_contended(&shrinker_rwsem)) { >>>> - freed = freed ? : 1; >>>> - break; >>>> - } >>>> - } >>>> - up_read(&shrinker_rwsem); >>>> -out: >>>> + rcu_read_lock(); >>> That new rcu_read_lock() won't help AFAIK, the whole >>> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be >>> safe. >> >> In the unregister_shrinker() path, we will wait for the refcount to zero >> before deleting the shrinker from the linked list. Here, we first took >> the rcu lock, and then decrement the refcount of this shrinker. >> >> shrink_slab unregister_shrinker >> =========== =================== >> >> /* wait for B */ >> wait_for_completion() >> rcu_read_lock() >> >> shrinker_put() --> (B) >> list_del_rcu() >> /* wait for rcu_read_unlock() */ >> kfree_rcu() >> >> /* >> * so this shrinker will not be freed here, >> * and can be used to traverse the next node >> * normally? >> */ >> list_for_each_entry() >> >> shrinker_try_get() >> rcu_read_unlock() >> >> Did I miss something? > > After calling rcu_read_unlock(), the next shrinker in the list can be freed, > so in the next iteration, use after free might happen? > > Is that right? IIUC, are you talking about the following scenario? shrink_slab unregister_shrinker a =========== ===================== rcu_read_unlock() /* next *shrinker b* was * removed from shrinker_list. */ /* wait for B */ wait_for_completion() rcu_read_lock() shrinker_put() --> (B) list_del_rcu() /* wait for rcu_read_unlock() */ kfree_rcu() list_for_each_entry() shrinker_try_get() rcu_read_unlock() When the next *shrinker b* is deleted, the *shrinker a* has not been deleted from the shrinker_list, so it will point a->next to b->next. Then in the next iteration, we will get the b->next instead of b? > >> >>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with >>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR. >>>> + shrinker_put(shrinker); >>>> + } >>>> + rcu_read_unlock(); >>>> cond_resched(); >>>> return freed; >>>> } >
On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: > On 6/22/23 10:53, Qi Zheng wrote: > > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, > > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > > > - if (!down_read_trylock(&shrinker_rwsem)) > > - goto out; > > - > > - list_for_each_entry(shrinker, &shrinker_list, list) { > > + rcu_read_lock(); > > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > > struct shrink_control sc = { > > .gfp_mask = gfp_mask, > > .nid = nid, > > .memcg = memcg, > > }; > > > > + if (!shrinker_try_get(shrinker)) > > + continue; > > + rcu_read_unlock(); > > I don't think you can do this unlock? > > > + > > ret = do_shrink_slab(&sc, shrinker, priority); > > if (ret == SHRINK_EMPTY) > > ret = 0; > > freed += ret; > > - /* > > - * Bail out if someone want to register a new shrinker to > > - * prevent the registration from being stalled for long periods > > - * by parallel ongoing shrinking. > > - */ > > - if (rwsem_is_contended(&shrinker_rwsem)) { > > - freed = freed ? : 1; > > - break; > > - } > > - } > > > > - up_read(&shrinker_rwsem); > > -out: > > + rcu_read_lock(); > > That new rcu_read_lock() won't help AFAIK, the whole > list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be > safe. Yeah, that's the pattern we've been taught and the one we can look at and immediately say "this is safe". This is a different pattern, as has been explained bi Qi, and I think it *might* be safe. *However.* Right now I don't have time to go through a novel RCU list iteration pattern it one step at to determine the correctness of the algorithm. I'm mostly worried about list manipulations that can occur outside rcu_read_lock() section bleeding into the RCU critical section because rcu_read_lock() by itself is not a memory barrier. Maybe Paul has seen this pattern often enough he could simply tell us what conditions it is safe in. But for me to work that out from first principles? I just don't have the time to do that right now. > IIUC this is why Dave in [4] suggests unifying shrink_slab() with > shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR. Yes, I suggested the IDR route because radix tree lookups under RCU with reference counted objects are a known safe pattern that we can easily confirm is correct or not. Hence I suggested the unification + IDR route because it makes the life of reviewers so, so much easier... Cheers, Dave.
On 2023/6/23 14:29, Dave Chinner wrote: > On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: >> On 6/22/23 10:53, Qi Zheng wrote: >>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, >>> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) >>> return shrink_slab_memcg(gfp_mask, nid, memcg, priority); >>> >>> - if (!down_read_trylock(&shrinker_rwsem)) >>> - goto out; >>> - >>> - list_for_each_entry(shrinker, &shrinker_list, list) { >>> + rcu_read_lock(); >>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >>> struct shrink_control sc = { >>> .gfp_mask = gfp_mask, >>> .nid = nid, >>> .memcg = memcg, >>> }; >>> >>> + if (!shrinker_try_get(shrinker)) >>> + continue; >>> + rcu_read_unlock(); >> >> I don't think you can do this unlock? >> >>> + >>> ret = do_shrink_slab(&sc, shrinker, priority); >>> if (ret == SHRINK_EMPTY) >>> ret = 0; >>> freed += ret; >>> - /* >>> - * Bail out if someone want to register a new shrinker to >>> - * prevent the registration from being stalled for long periods >>> - * by parallel ongoing shrinking. >>> - */ >>> - if (rwsem_is_contended(&shrinker_rwsem)) { >>> - freed = freed ? : 1; >>> - break; >>> - } >>> - } >>> >>> - up_read(&shrinker_rwsem); >>> -out: >>> + rcu_read_lock(); >> >> That new rcu_read_lock() won't help AFAIK, the whole >> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be >> safe. > > Yeah, that's the pattern we've been taught and the one we can look > at and immediately say "this is safe". > > This is a different pattern, as has been explained bi Qi, and I > think it *might* be safe. > > *However.* > > Right now I don't have time to go through a novel RCU list iteration > pattern it one step at to determine the correctness of the > algorithm. I'm mostly worried about list manipulations that can > occur outside rcu_read_lock() section bleeding into the RCU > critical section because rcu_read_lock() by itself is not a memory > barrier. > > Maybe Paul has seen this pattern often enough he could simply tell > us what conditions it is safe in. But for me to work that out from > first principles? I just don't have the time to do that right now. Hi Paul, can you help to confirm this? > >> IIUC this is why Dave in [4] suggests unifying shrink_slab() with >> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR. > > Yes, I suggested the IDR route because radix tree lookups under RCU > with reference counted objects are a known safe pattern that we can > easily confirm is correct or not. Hence I suggested the unification > + IDR route because it makes the life of reviewers so, so much > easier... In fact, I originally planned to try the unification + IDR method you suggested at the beginning. But in the case of CONFIG_MEMCG disabled, the struct mem_cgroup is not even defined, and root_mem_cgroup and shrinker_info will not be allocated. This required more code changes, so I ended up keeping the shrinker_list and implementing the above pattern. If the above pattern is not safe, I will go back to the unification + IDR method. Thanks, Qi > > Cheers, > > Dave.
On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote: > On 2023/6/23 14:29, Dave Chinner wrote: > > On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: > > > On 6/22/23 10:53, Qi Zheng wrote: > > Yes, I suggested the IDR route because radix tree lookups under RCU > > with reference counted objects are a known safe pattern that we can > > easily confirm is correct or not. Hence I suggested the unification > > + IDR route because it makes the life of reviewers so, so much > > easier... > > In fact, I originally planned to try the unification + IDR method you > suggested at the beginning. But in the case of CONFIG_MEMCG disabled, > the struct mem_cgroup is not even defined, and root_mem_cgroup and > shrinker_info will not be allocated. This required more code changes, so > I ended up keeping the shrinker_list and implementing the above pattern. Yes. Go back and read what I originally said needed to be done first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs to exist that holds all of the global shrinkers. Then shrink_slab() is only ever passed a memcg that should be iterated. Yes, it needs changes external to the shrinker code itself to be made to work. And even if memcg's are not enabled, we can still use the memcg structures to ensure a common abstraction is used for the shrinker tracking infrastructure.... > If the above pattern is not safe, I will go back to the unification + > IDR method. And that is exactly how we got into this mess in the first place.... -Dave
Hi Dave, On 2023/6/24 06:19, Dave Chinner wrote: > On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote: >> On 2023/6/23 14:29, Dave Chinner wrote: >>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: >>>> On 6/22/23 10:53, Qi Zheng wrote: >>> Yes, I suggested the IDR route because radix tree lookups under RCU >>> with reference counted objects are a known safe pattern that we can >>> easily confirm is correct or not. Hence I suggested the unification >>> + IDR route because it makes the life of reviewers so, so much >>> easier... >> >> In fact, I originally planned to try the unification + IDR method you >> suggested at the beginning. But in the case of CONFIG_MEMCG disabled, >> the struct mem_cgroup is not even defined, and root_mem_cgroup and >> shrinker_info will not be allocated. This required more code changes, so >> I ended up keeping the shrinker_list and implementing the above pattern. > > Yes. Go back and read what I originally said needed to be done > first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs > to exist that holds all of the global shrinkers. Then shrink_slab() > is only ever passed a memcg that should be iterated. > > Yes, it needs changes external to the shrinker code itself to be > made to work. And even if memcg's are not enabled, we can still use > the memcg structures to ensure a common abstraction is used for the > shrinker tracking infrastructure.... Yeah, what I imagined before was to define a more concise struct mem_cgroup in the case of CONFIG_MEMCG=n, then allocate a dummy root memcg on system boot: #ifdef !CONFIG_MEMCG struct shrinker_info { struct rcu_head rcu; atomic_long_t *nr_deferred; unsigned long *map; int map_nr_max; }; struct mem_cgroup_per_node { struct shrinker_info __rcu *shrinker_info; }; struct mem_cgroup { struct mem_cgroup_per_node *nodeinfo[]; }; #endif But I have a concern: if all global shrinkers are tracking with the info->map of root memcg, a shrinker->id needs to be assigned to them, which will cause info->map_nr_max to become larger than before, then making the traversal of info->map slower. > >> If the above pattern is not safe, I will go back to the unification + >> IDR method. > > And that is exactly how we got into this mess in the first place.... I only found one similar pattern in the kernel: fs/smb/server/oplock.c:find_same_lease_key/smb_break_all_levII_oplock/lookup_lease_in_table But IIUC, the refcount here needs to be decremented after holding rcu lock as I did above. So regardless of whether we choose unification + IDR in the end, I still want to confirm whether the pattern I implemented above is safe. :) Thanks, Qi > > -Dave
On 2023/6/24 19:08, Qi Zheng wrote: > Hi Dave, > > On 2023/6/24 06:19, Dave Chinner wrote: >> On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote: >>> On 2023/6/23 14:29, Dave Chinner wrote: >>>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: >>>>> On 6/22/23 10:53, Qi Zheng wrote: >>>> Yes, I suggested the IDR route because radix tree lookups under RCU >>>> with reference counted objects are a known safe pattern that we can >>>> easily confirm is correct or not. Hence I suggested the unification >>>> + IDR route because it makes the life of reviewers so, so much >>>> easier... >>> >>> In fact, I originally planned to try the unification + IDR method you >>> suggested at the beginning. But in the case of CONFIG_MEMCG disabled, >>> the struct mem_cgroup is not even defined, and root_mem_cgroup and >>> shrinker_info will not be allocated. This required more code >>> changes, so >>> I ended up keeping the shrinker_list and implementing the above pattern. >> >> Yes. Go back and read what I originally said needed to be done >> first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs >> to exist that holds all of the global shrinkers. Then shrink_slab() >> is only ever passed a memcg that should be iterated. >> >> Yes, it needs changes external to the shrinker code itself to be >> made to work. And even if memcg's are not enabled, we can still use >> the memcg structures to ensure a common abstraction is used for the >> shrinker tracking infrastructure.... > > Yeah, what I imagined before was to define a more concise struct > mem_cgroup in the case of CONFIG_MEMCG=n, then allocate a dummy root > memcg on system boot: > > #ifdef !CONFIG_MEMCG > > struct shrinker_info { > struct rcu_head rcu; > atomic_long_t *nr_deferred; > unsigned long *map; > int map_nr_max; > }; > > struct mem_cgroup_per_node { > struct shrinker_info __rcu *shrinker_info; > }; > > struct mem_cgroup { > struct mem_cgroup_per_node *nodeinfo[]; > }; > > #endif > > But I have a concern: if all global shrinkers are tracking with the > info->map of root memcg, a shrinker->id needs to be assigned to them, > which will cause info->map_nr_max to become larger than before, then > making the traversal of info->map slower. But most of the system is 'sb-xxx' shrinker instances, they all have the SHRINKER_MEMCG_AWARE flag, so it should have little impact on the speed of traversing info->map. ;) > >> >>> If the above pattern is not safe, I will go back to the unification + >>> IDR method. >> >> And that is exactly how we got into this mess in the first place.... > > I only found one similar pattern in the kernel: > > fs/smb/server/oplock.c:find_same_lease_key/smb_break_all_levII_oplock/lookup_lease_in_table > > But IIUC, the refcount here needs to be decremented after holding > rcu lock as I did above. > > So regardless of whether we choose unification + IDR in the end, I still > want to confirm whether the pattern I implemented above is safe. :) Also + RCU mailing list. > > Thanks, > Qi > >> >> -Dave
On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote: > On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: > > On 6/22/23 10:53, Qi Zheng wrote: > > > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, > > > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > > > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > > > > > - if (!down_read_trylock(&shrinker_rwsem)) > > > - goto out; > > > - > > > - list_for_each_entry(shrinker, &shrinker_list, list) { > > > + rcu_read_lock(); > > > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > > > struct shrink_control sc = { > > > .gfp_mask = gfp_mask, > > > .nid = nid, > > > .memcg = memcg, > > > }; > > > > > > + if (!shrinker_try_get(shrinker)) > > > + continue; > > > + rcu_read_unlock(); > > > > I don't think you can do this unlock? Sorry to be slow to respond here, this one fell through the cracks. And thank you to Qi for reminding me! If you do this unlock, you had jolly well better nail down the current element (the one referenced by shrinker), for example, by acquiring an explicit reference count on the object. And presumably this is exactly what shrinker_try_get() is doing. And a look at your 24/29 confirms this, at least assuming that shrinker->refcount is set to zero before the call to synchronize_rcu() in free_module() *and* that synchronize_rcu() doesn't start until *after* shrinker_put() calls complete(). Plus, as always, the object must be removed from the list before the synchronize_rcu() starts. (On these parts of the puzzle, I defer to those more familiar with this code path. And I strongly suggest carefully commenting this type of action-at-a-distance design pattern.) Why is this important? Because otherwise that object might be freed before you get to the call to rcu_read_lock() at the end of this loop. And if that happens, list_for_each_entry_rcu() will be walking the freelist, which is quite bad for the health and well-being of your kernel. There are a few other ways to make this sort of thing work: 1. Defer the shrinker_put() to the beginning of the loop. You would need a flag initially set to zero, and then set to one just before (or just after) the rcu_read_lock() above. You would also need another shrinker_old pointer to track the old pointer. Then at the top of the loop, if the flag is set, invoke shrinker_put() on shrinker_old. This ensures that the previous shrinker structure stays around long enough to allow the loop to find the next shrinker structure in the list. This approach is attractive when the removal code path can invoke shrinker_put() after the grace period ends. 2. Make shrinker_put() invoke call_rcu() when ->refcount reaches zero, and have the callback function free the object. This of course requires adding an rcu_head structure to the shrinker structure, which might or might not be a reasonable course of action. If adding that rcu_head is reasonable, this simplifies the logic quite a bit. 3. For the shrinker-structure-removal code path, remove the shrinker structure, then remove the initial count from ->refcount, and then keep doing grace periods until ->refcount is zero, then do one more. Of course, if the result of removing the initial count was zero, then only a single additional grace period is required. This would need to be carefully commented, as it is a bit unconventional. There are probably many other ways, but just to give an idea of a few other ways to do this. > > > + > > > ret = do_shrink_slab(&sc, shrinker, priority); > > > if (ret == SHRINK_EMPTY) > > > ret = 0; > > > freed += ret; > > > - /* > > > - * Bail out if someone want to register a new shrinker to > > > - * prevent the registration from being stalled for long periods > > > - * by parallel ongoing shrinking. > > > - */ > > > - if (rwsem_is_contended(&shrinker_rwsem)) { > > > - freed = freed ? : 1; > > > - break; > > > - } > > > - } > > > > > > - up_read(&shrinker_rwsem); > > > -out: > > > + rcu_read_lock(); > > > > That new rcu_read_lock() won't help AFAIK, the whole > > list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be > > safe. > > Yeah, that's the pattern we've been taught and the one we can look > at and immediately say "this is safe". > > This is a different pattern, as has been explained bi Qi, and I > think it *might* be safe. > > *However.* > > Right now I don't have time to go through a novel RCU list iteration > pattern it one step at to determine the correctness of the > algorithm. I'm mostly worried about list manipulations that can > occur outside rcu_read_lock() section bleeding into the RCU > critical section because rcu_read_lock() by itself is not a memory > barrier. > > Maybe Paul has seen this pattern often enough he could simply tell > us what conditions it is safe in. But for me to work that out from > first principles? I just don't have the time to do that right now. If the code does just the right sequence of things on the removal path (remove, decrement reference, wait for reference to go to zero, wait for grace period, free), then it would work. If this is what is happening, I would argue for more comments. ;-) Thanx, Paul > > IIUC this is why Dave in [4] suggests unifying shrink_slab() with > > shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR. > > Yes, I suggested the IDR route because radix tree lookups under RCU > with reference counted objects are a known safe pattern that we can > easily confirm is correct or not. Hence I suggested the unification > + IDR route because it makes the life of reviewers so, so much > easier... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On 2023/7/4 00:39, Paul E. McKenney wrote: > On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote: >> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: >>> On 6/22/23 10:53, Qi Zheng wrote: >>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, >>>> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) >>>> return shrink_slab_memcg(gfp_mask, nid, memcg, priority); >>>> >>>> - if (!down_read_trylock(&shrinker_rwsem)) >>>> - goto out; >>>> - >>>> - list_for_each_entry(shrinker, &shrinker_list, list) { >>>> + rcu_read_lock(); >>>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >>>> struct shrink_control sc = { >>>> .gfp_mask = gfp_mask, >>>> .nid = nid, >>>> .memcg = memcg, >>>> }; >>>> >>>> + if (!shrinker_try_get(shrinker)) >>>> + continue; >>>> + rcu_read_unlock(); >>> >>> I don't think you can do this unlock? > > Sorry to be slow to respond here, this one fell through the cracks. > And thank you to Qi for reminding me! > > If you do this unlock, you had jolly well better nail down the current > element (the one referenced by shrinker), for example, by acquiring an > explicit reference count on the object. And presumably this is exactly > what shrinker_try_get() is doing. And a look at your 24/29 confirms this, > at least assuming that shrinker->refcount is set to zero before the call > to synchronize_rcu() in free_module() *and* that synchronize_rcu() doesn't > start until *after* shrinker_put() calls complete(). Plus, as always, > the object must be removed from the list before the synchronize_rcu() > starts. (On these parts of the puzzle, I defer to those more familiar > with this code path. And I strongly suggest carefully commenting this > type of action-at-a-distance design pattern.) Yeah, I think I've done it like above. A more detailed timing diagram is below. > > Why is this important? Because otherwise that object might be freed > before you get to the call to rcu_read_lock() at the end of this loop. > And if that happens, list_for_each_entry_rcu() will be walking the > freelist, which is quite bad for the health and well-being of your kernel. > > There are a few other ways to make this sort of thing work: > > 1. Defer the shrinker_put() to the beginning of the loop. > You would need a flag initially set to zero, and then set to > one just before (or just after) the rcu_read_lock() above. > You would also need another shrinker_old pointer to track the > old pointer. Then at the top of the loop, if the flag is set, > invoke shrinker_put() on shrinker_old. This ensures that the > previous shrinker structure stays around long enough to allow > the loop to find the next shrinker structure in the list. > > This approach is attractive when the removal code path > can invoke shrinker_put() after the grace period ends. > > 2. Make shrinker_put() invoke call_rcu() when ->refcount reaches > zero, and have the callback function free the object. This of > course requires adding an rcu_head structure to the shrinker > structure, which might or might not be a reasonable course of > action. If adding that rcu_head is reasonable, this simplifies > the logic quite a bit. > > 3. For the shrinker-structure-removal code path, remove the shrinker > structure, then remove the initial count from ->refcount, > and then keep doing grace periods until ->refcount is zero, > then do one more. Of course, if the result of removing the > initial count was zero, then only a single additional grace > period is required. > > This would need to be carefully commented, as it is a bit > unconventional. Thanks for such a detailed addition! > > There are probably many other ways, but just to give an idea of a few > other ways to do this. > >>>> + >>>> ret = do_shrink_slab(&sc, shrinker, priority); >>>> if (ret == SHRINK_EMPTY) >>>> ret = 0; >>>> freed += ret; >>>> - /* >>>> - * Bail out if someone want to register a new shrinker to >>>> - * prevent the registration from being stalled for long periods >>>> - * by parallel ongoing shrinking. >>>> - */ >>>> - if (rwsem_is_contended(&shrinker_rwsem)) { >>>> - freed = freed ? : 1; >>>> - break; >>>> - } >>>> - } >>>> >>>> - up_read(&shrinker_rwsem); >>>> -out: >>>> + rcu_read_lock(); >>> >>> That new rcu_read_lock() won't help AFAIK, the whole >>> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be >>> safe. >> >> Yeah, that's the pattern we've been taught and the one we can look >> at and immediately say "this is safe". >> >> This is a different pattern, as has been explained bi Qi, and I >> think it *might* be safe. >> >> *However.* >> >> Right now I don't have time to go through a novel RCU list iteration >> pattern it one step at to determine the correctness of the >> algorithm. I'm mostly worried about list manipulations that can >> occur outside rcu_read_lock() section bleeding into the RCU >> critical section because rcu_read_lock() by itself is not a memory >> barrier. >> >> Maybe Paul has seen this pattern often enough he could simply tell >> us what conditions it is safe in. But for me to work that out from >> first principles? I just don't have the time to do that right now. > > If the code does just the right sequence of things on the removal path > (remove, decrement reference, wait for reference to go to zero, wait for > grace period, free), then it would work. If this is what is happening, > I would argue for more comments. ;-) The order of the removal path is slightly different from this: shrink_slab unregister_shrinker =========== =================== shrinker_try_get() rcu_read_unlock() 1. decrement initial reference shrinker_put() 2. wait for reference to go to zero wait_for_completion() rcu_read_lock() shrinker_put() 3. remove the shrinker from list list_del_rcu() 4. wait for grace period kfree_rcu()/synchronize_rcu() list_for_each_entry() shrinker_try_get() rcu_read_unlock() 5. free the shrinker So the order is: decrement reference, wait for reference to go to zero, remove, wait for grace period, free. I think this can work. And we can only do the *step 3* after we hold the RCU read lock again, right? Please let me know if I missed something. Thanks, Qi > > Thanx, Paul > >>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with >>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR. >> >> Yes, I suggested the IDR route because radix tree lookups under RCU >> with reference counted objects are a known safe pattern that we can >> easily confirm is correct or not. Hence I suggested the unification >> + IDR route because it makes the life of reviewers so, so much >> easier... >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com
Hi Dave, On 2023/6/24 19:08, Qi Zheng wrote: > Hi Dave, > > On 2023/6/24 06:19, Dave Chinner wrote: >> On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote: >>> On 2023/6/23 14:29, Dave Chinner wrote: >>>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: >>>>> On 6/22/23 10:53, Qi Zheng wrote: >>>> Yes, I suggested the IDR route because radix tree lookups under RCU >>>> with reference counted objects are a known safe pattern that we can >>>> easily confirm is correct or not. Hence I suggested the unification >>>> + IDR route because it makes the life of reviewers so, so much >>>> easier... >>> >>> In fact, I originally planned to try the unification + IDR method you >>> suggested at the beginning. But in the case of CONFIG_MEMCG disabled, >>> the struct mem_cgroup is not even defined, and root_mem_cgroup and >>> shrinker_info will not be allocated. This required more code >>> changes, so >>> I ended up keeping the shrinker_list and implementing the above pattern. >> >> Yes. Go back and read what I originally said needed to be done >> first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs >> to exist that holds all of the global shrinkers. Then shrink_slab() >> is only ever passed a memcg that should be iterated. >> >> Yes, it needs changes external to the shrinker code itself to be >> made to work. And even if memcg's are not enabled, we can still use >> the memcg structures to ensure a common abstraction is used for the >> shrinker tracking infrastructure.... > > Yeah, what I imagined before was to define a more concise struct > mem_cgroup in the case of CONFIG_MEMCG=n, then allocate a dummy root > memcg on system boot: > > #ifdef !CONFIG_MEMCG > > struct shrinker_info { > struct rcu_head rcu; > atomic_long_t *nr_deferred; > unsigned long *map; > int map_nr_max; > }; > > struct mem_cgroup_per_node { > struct shrinker_info __rcu *shrinker_info; > }; > > struct mem_cgroup { > struct mem_cgroup_per_node *nodeinfo[]; > }; > > #endif These days I tried doing this: 1. CONFIG_MEMCG && !mem_cgroup_disabled() track all global shrinkers with root_mem_cgroup. 2. CONFIG_MEMCG && mem_cgroup_disabled() the root_mem_cgroup is also allocated in this case, so still use root_mem_cgroup to track all global shrinkers. 3. !CONFIG_MEMCG allocate a dummy memcg during system startup (after cgroup_init()) and use it to track all global shrinkers This works, but needs to modify the startup order of some subsystems, because some shrinkers will be registered before root_mem_cgroup is allocated, such as: 1. rcu-kfree shrinker in rcu_init() 2. super block shrinkers in vfs_caches_init() And cgroup_init() also depends on some file system infrastructure, so I made some changes (rough and unorganized): diff --git a/fs/namespace.c b/fs/namespace.c index e157efc54023..6a12d3d0064e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4706,7 +4706,7 @@ static void __init init_mount_tree(void) void __init mnt_init(void) { - int err; + //int err; mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); @@ -4725,15 +4725,7 @@ void __init mnt_init(void) if (!mount_hashtable || !mountpoint_hashtable) panic("Failed to allocate mount hash table\n"); - kernfs_init(); - - err = sysfs_init(); - if (err) - printk(KERN_WARNING "%s: sysfs_init error: %d\n", - __func__, err); - fs_kobj = kobject_create_and_add("fs", NULL); - if (!fs_kobj) - printk(KERN_WARNING "%s: kobj create error\n", __func__); shmem_init(); init_rootfs(); init_mount_tree(); diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 7d9c2a63b7cd..d87c67f6f66e 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -119,6 +119,7 @@ static inline void call_rcu_hurry(struct rcu_head *head, rcu_callback_t func) /* Internal to kernel */ void rcu_init(void); +void rcu_shrinker_init(void); extern int rcu_scheduler_active; void rcu_sched_clock_irq(int user); void rcu_report_dead(unsigned int cpu); diff --git a/init/main.c b/init/main.c index ad920fac325c..4190fc6d10ad 100644 --- a/init/main.c +++ b/init/main.c @@ -1049,14 +1049,22 @@ void start_kernel(void) security_init(); dbg_late_init(); net_ns_init(); + kernfs_init(); + if (sysfs_init()) + printk(KERN_WARNING "%s: sysfs_init error\n", + __func__); + fs_kobj = kobject_create_and_add("fs", NULL); + if (!fs_kobj) + printk(KERN_WARNING "%s: kobj create error\n", __func__); + proc_root_init(); + cgroup_init(); vfs_caches_init(); pagecache_init(); signals_init(); seq_file_init(); - proc_root_init(); nsfs_init(); cpuset_init(); - cgroup_init(); + rcu_shrinker_init(); taskstats_init_early(); delayacct_init(); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d068ce3567fc..71a04ae8defb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4953,7 +4953,10 @@ static void __init kfree_rcu_batch_init(void) INIT_DELAYED_WORK(&krcp->page_cache_work, fill_page_cache_func); krcp->initialized = true; } +} +void __init rcu_shrinker_init(void) +{ kfree_rcu_shrinker = shrinker_alloc(0, "rcu-kfree"); if (!kfree_rcu_shrinker) { pr_err("Failed to allocate kfree_rcu() shrinker!\n"); I adjusted it step by step according to the errors reported, and there may be hidden problems (needs more review and testing). In addition, unifying the processing of global and memcg slab shrink does have many benefits: 1. shrinker::nr_deferred can be removed 2. shrinker_list can be removed 3. simplifies the existing code logic and subsequent lockless processing But I'm still a bit apprehensive about modifying the boot order. :( What do you think about this? Thanks, Qi > > But I have a concern: if all global shrinkers are tracking with the > info->map of root memcg, a shrinker->id needs to be assigned to them, > which will cause info->map_nr_max to become larger than before, then > making the traversal of info->map slower. > >> >>> If the above pattern is not safe, I will go back to the unification + >>> IDR method. >> >> And that is exactly how we got into this mess in the first place.... > > I only found one similar pattern in the kernel: > > fs/smb/server/oplock.c:find_same_lease_key/smb_break_all_levII_oplock/lookup_lease_in_table > > But IIUC, the refcount here needs to be decremented after holding > rcu lock as I did above. > > So regardless of whether we choose unification + IDR in the end, I still > want to confirm whether the pattern I implemented above is safe. :) > > Thanks, > Qi > >> >> -Dave
On 2023/7/4 11:45, Qi Zheng wrote: > > > On 2023/7/4 00:39, Paul E. McKenney wrote: >> On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote: >>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: >>>> On 6/22/23 10:53, Qi Zheng wrote: >>>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t >>>>> gfp_mask, int nid, >>>>> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) >>>>> return shrink_slab_memcg(gfp_mask, nid, memcg, priority); >>>>> - if (!down_read_trylock(&shrinker_rwsem)) >>>>> - goto out; >>>>> - >>>>> - list_for_each_entry(shrinker, &shrinker_list, list) { >>>>> + rcu_read_lock(); >>>>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >>>>> struct shrink_control sc = { >>>>> .gfp_mask = gfp_mask, >>>>> .nid = nid, >>>>> .memcg = memcg, >>>>> }; >>>>> + if (!shrinker_try_get(shrinker)) >>>>> + continue; >>>>> + rcu_read_unlock(); >>>> >>>> I don't think you can do this unlock? >> >> Sorry to be slow to respond here, this one fell through the cracks. >> And thank you to Qi for reminding me! >> >> If you do this unlock, you had jolly well better nail down the current >> element (the one referenced by shrinker), for example, by acquiring an >> explicit reference count on the object. And presumably this is exactly >> what shrinker_try_get() is doing. And a look at your 24/29 confirms >> this, >> at least assuming that shrinker->refcount is set to zero before the call >> to synchronize_rcu() in free_module() *and* that synchronize_rcu() >> doesn't >> start until *after* shrinker_put() calls complete(). Plus, as always, >> the object must be removed from the list before the synchronize_rcu() >> starts. (On these parts of the puzzle, I defer to those more familiar >> with this code path. And I strongly suggest carefully commenting this >> type of action-at-a-distance design pattern.) > > Yeah, I think I've done it like above. A more detailed timing diagram is > below. > >> >> Why is this important? Because otherwise that object might be freed >> before you get to the call to rcu_read_lock() at the end of this loop. >> And if that happens, list_for_each_entry_rcu() will be walking the >> freelist, which is quite bad for the health and well-being of your >> kernel. >> >> There are a few other ways to make this sort of thing work: >> >> 1. Defer the shrinker_put() to the beginning of the loop. >> You would need a flag initially set to zero, and then set to >> one just before (or just after) the rcu_read_lock() above. >> You would also need another shrinker_old pointer to track the >> old pointer. Then at the top of the loop, if the flag is set, >> invoke shrinker_put() on shrinker_old. This ensures that the >> previous shrinker structure stays around long enough to allow >> the loop to find the next shrinker structure in the list. >> >> This approach is attractive when the removal code path >> can invoke shrinker_put() after the grace period ends. >> >> 2. Make shrinker_put() invoke call_rcu() when ->refcount reaches >> zero, and have the callback function free the object. This of >> course requires adding an rcu_head structure to the shrinker >> structure, which might or might not be a reasonable course of >> action. If adding that rcu_head is reasonable, this simplifies >> the logic quite a bit. >> >> 3. For the shrinker-structure-removal code path, remove the shrinker >> structure, then remove the initial count from ->refcount, >> and then keep doing grace periods until ->refcount is zero, >> then do one more. Of course, if the result of removing the >> initial count was zero, then only a single additional grace >> period is required. >> >> This would need to be carefully commented, as it is a bit >> unconventional. > > Thanks for such a detailed addition! > >> >> There are probably many other ways, but just to give an idea of a few >> other ways to do this. >> >>>>> + >>>>> ret = do_shrink_slab(&sc, shrinker, priority); >>>>> if (ret == SHRINK_EMPTY) >>>>> ret = 0; >>>>> freed += ret; >>>>> - /* >>>>> - * Bail out if someone want to register a new shrinker to >>>>> - * prevent the registration from being stalled for long >>>>> periods >>>>> - * by parallel ongoing shrinking. >>>>> - */ >>>>> - if (rwsem_is_contended(&shrinker_rwsem)) { >>>>> - freed = freed ? : 1; >>>>> - break; >>>>> - } >>>>> - } >>>>> - up_read(&shrinker_rwsem); >>>>> -out: >>>>> + rcu_read_lock(); >>>> >>>> That new rcu_read_lock() won't help AFAIK, the whole >>>> list_for_each_entry_rcu() needs to be under the single >>>> rcu_read_lock() to be >>>> safe. >>> >>> Yeah, that's the pattern we've been taught and the one we can look >>> at and immediately say "this is safe". >>> >>> This is a different pattern, as has been explained bi Qi, and I >>> think it *might* be safe. >>> >>> *However.* >>> >>> Right now I don't have time to go through a novel RCU list iteration >>> pattern it one step at to determine the correctness of the >>> algorithm. I'm mostly worried about list manipulations that can >>> occur outside rcu_read_lock() section bleeding into the RCU >>> critical section because rcu_read_lock() by itself is not a memory >>> barrier. >>> >>> Maybe Paul has seen this pattern often enough he could simply tell >>> us what conditions it is safe in. But for me to work that out from >>> first principles? I just don't have the time to do that right now. >> >> If the code does just the right sequence of things on the removal path >> (remove, decrement reference, wait for reference to go to zero, wait for >> grace period, free), then it would work. If this is what is happening, >> I would argue for more comments. ;-) > > The order of the removal path is slightly different from this: > > shrink_slab unregister_shrinker > =========== =================== > > shrinker_try_get() > rcu_read_unlock() > 1. decrement initial reference > shrinker_put() > 2. wait for reference to go to zero > wait_for_completion() > rcu_read_lock() > > shrinker_put() > 3. remove the shrinker from list > list_del_rcu() > 4. wait for grace period > kfree_rcu()/synchronize_rcu() > > > list_for_each_entry() > > shrinker_try_get() > rcu_read_unlock() > 5. free the shrinker > > So the order is: decrement reference, wait for reference to go to zero, > remove, wait for grace period, free. > > I think this can work. And we can only do the *step 3* after we hold the > RCU read lock again, right? Please let me know if I missed something. Oh, you are right, It would be better to move step 3 to step 1. We should first remove the shrinker from the shrinker_list to prevent other traversers from finding it again, otherwise the following situations may occur theoretically: CPU 0 CPU 1 shrinker_try_get() shrinker_try_get() shrinker_put() shrinker_try_get() shrinker_put() Thanks, Qi > > Thanks, > Qi > >> >> Thanx, Paul >> >>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with >>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses >>>> IDR. >>> >>> Yes, I suggested the IDR route because radix tree lookups under RCU >>> with reference counted objects are a known safe pattern that we can >>> easily confirm is correct or not. Hence I suggested the unification >>> + IDR route because it makes the life of reviewers so, so much >>> easier... >>> >>> Cheers, >>> >>> Dave. >>> -- >>> Dave Chinner >>> david@fromorbit.com
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 7bfeb2f25246..b0c6c2df9db8 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -74,6 +74,7 @@ struct shrinker { refcount_t refcount; struct completion completion_wait; + struct rcu_head rcu; void *private_data; @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count, void shrinker_free(struct shrinker *shrinker); void unregister_and_free_shrinker(struct shrinker *shrinker); +static inline bool shrinker_try_get(struct shrinker *shrinker) +{ + return refcount_inc_not_zero(&shrinker->refcount); +} + static inline void shrinker_put(struct shrinker *shrinker) { if (refcount_dec_and_test(&shrinker->refcount)) diff --git a/mm/vmscan.c b/mm/vmscan.c index 6f9c4750effa..767569698946 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -57,6 +57,7 @@ #include <linux/khugepaged.h> #include <linux/rculist_nulls.h> #include <linux/random.h> +#include <linux/rculist.h> #include <asm/tlbflush.h> #include <asm/div64.h> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker) down_write(&shrinker_rwsem); refcount_set(&shrinker->refcount, 1); init_completion(&shrinker->completion_wait); - list_add_tail(&shrinker->list, &shrinker_list); + list_add_tail_rcu(&shrinker->list, &shrinker_list); shrinker->flags |= SHRINKER_REGISTERED; shrinker_debugfs_add(shrinker); up_write(&shrinker_rwsem); @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker) wait_for_completion(&shrinker->completion_wait); down_write(&shrinker_rwsem); - list_del(&shrinker->list); + list_del_rcu(&shrinker->list); shrinker->flags &= ~SHRINKER_REGISTERED; if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free); void unregister_and_free_shrinker(struct shrinker *shrinker) { unregister_shrinker(shrinker); - kfree(shrinker); + kfree_rcu(shrinker, rcu); } EXPORT_SYMBOL(unregister_and_free_shrinker); @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) return shrink_slab_memcg(gfp_mask, nid, memcg, priority); - if (!down_read_trylock(&shrinker_rwsem)) - goto out; - - list_for_each_entry(shrinker, &shrinker_list, list) { + rcu_read_lock(); + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, .memcg = memcg, }; + if (!shrinker_try_get(shrinker)) + continue; + rcu_read_unlock(); + ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) ret = 0; freed += ret; - /* - * Bail out if someone want to register a new shrinker to - * prevent the registration from being stalled for long periods - * by parallel ongoing shrinking. - */ - if (rwsem_is_contended(&shrinker_rwsem)) { - freed = freed ? : 1; - break; - } - } - up_read(&shrinker_rwsem); -out: + rcu_read_lock(); + shrinker_put(shrinker); + } + rcu_read_unlock(); cond_resched(); return freed; }
The shrinker_rwsem is a global read-write lock in shrinkers subsystem, which protects most operations such as slab shrink, registration and unregistration of shrinkers, etc. This can easily cause problems in the following cases. 1) When the memory pressure is high and there are many filesystems mounted or unmounted at the same time, slab shrink will be affected (down_read_trylock() failed). Such as the real workload mentioned by Kirill Tkhai: ``` One of the real workloads from my experience is start of an overcommitted node containing many starting containers after node crash (or many resuming containers after reboot for kernel update). In these cases memory pressure is huge, and the node goes round in long reclaim. ``` 2) If a shrinker is blocked (such as the case mentioned in [1]) and a writer comes in (such as mount a fs), then this writer will be blocked and cause all subsequent shrinker-related operations to be blocked. Even if there is no competitor when shrinking slab, there may still be a problem. If we have a long shrinker list and we do not reclaim enough memory with each shrinker, then the down_read_trylock() may be called with high frequency. Because of the poor multicore scalability of atomic operations, this can lead to a significant drop in IPC (instructions per cycle). We used to implement the lockless slab shrink with SRCU [1], but then kernel test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test case [2], so we reverted it [3]. This commit uses the refcount+RCU method [4] proposed by by Dave Chinner to re-implement the lockless global slab shrink. The memcg slab shrink is handled in the subsequent patch. Currently, the shrinker instances can be divided into the following three types: a) global shrinker instance statically defined in the kernel, such as workingset_shadow_shrinker. b) global shrinker instance statically defined in the kernel modules, such as mmu_shrinker in x86. c) shrinker instance embedded in other structures. For case a, the memory of shrinker instance is never freed. For case b, the memory of shrinker instance will be freed after the module is unloaded. But we will call synchronize_rcu() in free_module() to wait for RCU read-side critical section to exit. For case c, the memory of shrinker instance will be dynamically freed by calling kfree_rcu(). So we can use rcu_read_{lock,unlock}() to ensure that the shrinker instance is valid. The shrinker::refcount mechanism ensures that the shrinker instance will not be run again after unregistration. So the structure that records the pointer of shrinker instance can be safely freed without waiting for the RCU read-side critical section. In this way, while we implement the lockless slab shrink, we don't need to be blocked in unregister_shrinker() to wait RCU read-side critical section. The following are the test results: stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & 1) Before applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 880623 60.02 7.71 226.93 14671.45 3753.09 ramfs: 1 System Management Interrupt for a 60.03s run time: 5762.40s available CPU time 7.71s user time ( 0.13%) 226.93s system time ( 3.94%) 234.64s total time ( 4.07%) load average: 8.54 3.06 2.11 passed: 9: ramfs (9) failed: 0 skipped: 0 successful run completed in 60.03s (1 min, 0.03 secs) 2) After applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 847562 60.02 7.44 230.22 14120.66 3566.23 ramfs: 4 System Management Interrupts for a 60.12s run time: 5771.95s available CPU time 7.44s user time ( 0.13%) 230.22s system time ( 3.99%) 237.66s total time ( 4.12%) load average: 8.18 2.43 0.84 passed: 9: ramfs (9) failed: 0 skipped: 0 successful run completed in 60.12s (1 min, 0.12 secs) We can see that the ops/s has hardly changed. [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/ [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/ [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- include/linux/shrinker.h | 6 ++++++ mm/vmscan.c | 33 ++++++++++++++------------------- 2 files changed, 20 insertions(+), 19 deletions(-)