Message ID | 20191129214541.3110-1-ptikhomirov@virtuozzo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: fix hanging shrinker management on long do_shrink_slab | expand |
On 11/30/19 12:45 AM, Pavel Tikhomirov wrote: > We have a problem that shrinker_rwsem can be held for a long time for > read in shrink_slab, at the same time any process which is trying to > manage shrinkers hangs. > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. > It tries to shrink something on nfs (hard) but nfs server is dead at > these moment already and rpc will never succeed. Generally any shrinker > can take significant time to do_shrink_slab, so it's a bad idea to hold > the list lock here. > > We have a similar problem in shrink_slab_memcg, except that we are > traversing shrinker_map+shrinker_idr there. > > The idea of the patch is to inc a refcount to the chosen shrinker so it > won't disappear and release shrinker_rwsem while we are in > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec > the refcount and continue the traversal. > > We also need a wait_queue so that unregister_shrinker can wait for the > refcnt to become zero. Only after these we can safely remove the > shrinker from list and idr, and free the shrinker. > > I've reproduced the nfs hang in do_shrink_slab with the patch applied on > ms kernel, all other mount/unmount pass fine without any hang. > > Here is a reproduction on kernel without patch: > > 1) Setup nfs on server node with some files in it (e.g. 200) > > [server]# cat /etc/exports > /vz/nfs2 *(ro,no_root_squash,no_subtree_check,async) > > 2) Hard mount it on client node > > [client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt > > 3) Open some (e.g. 200) files on the mount > > [client]# for i in $(find /mnt/ -type f | head -n 200); \ > do setsid sleep 1000 &>/dev/null <$i & done > > 4) Kill all openers > > [client]# killall sleep -9 > > 5) Put your network cable out on client node > > 6) Drop caches on the client, it will hang on nfs while holding > shrinker_rwsem lock for read > > [client]# echo 3 > /proc/sys/vm/drop_caches > > crash> bt ... > PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" > #0 [...] __schedule at ... > #1 [...] schedule at ... > #2 [...] rpc_wait_bit_killable at ... [sunrpc] > #3 [...] __wait_on_bit at ... > #4 [...] out_of_line_wait_on_bit at ... > #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] > #6 [...] nfs4_proc_delegreturn at ... [nfsv4] > #7 [...] nfs_do_return_delegation at ... [nfsv4] > #8 [...] nfs4_evict_inode at ... [nfsv4] > #9 [...] evict at ... > #10 [...] dispose_list at ... > #11 [...] prune_icache_sb at ... > #12 [...] super_cache_scan at ... > #13 [...] do_shrink_slab at ... > #14 [...] shrink_slab at ... > #15 [...] drop_slab_node at ... > #16 [...] drop_slab at ... > #17 [...] drop_caches_sysctl_handler at ... > #18 [...] proc_sys_call_handler at ... > #19 [...] vfs_write at ... > #20 [...] ksys_write at ... > #21 [...] do_syscall_64 at ... > #22 [...] entry_SYSCALL_64_after_hwframe at ... > > 7) All other mount/umount activity now hangs with no luck to take > shrinker_rwsem for write. > > [client]# mount -t tmpfs tmpfs /tmp > > crash> bt ... > PID: 5464 TASK: ... CPU: 3 COMMAND: "mount" > #0 [...] __schedule at ... > #1 [...] schedule at ... > #2 [...] rwsem_down_write_slowpath at ... > #3 [...] prealloc_shrinker at ... > #4 [...] alloc_super at ... > #5 [...] sget at ... > #6 [...] mount_nodev at ... > #7 [...] legacy_get_tree at ... > #8 [...] vfs_get_tree at ... > #9 [...] do_mount at ... > #10 [...] ksys_mount at ... > #11 [...] __x64_sys_mount at ... > #12 [...] do_syscall_64 at ... > #13 [...] entry_SYSCALL_64_after_hwframe at ... > I don't think this patch solves the problem, it only fixes one minor symptom of it. The actual problem here the reclaim hang in the nfs. It means that any process, including kswapd, may go into nfs inode reclaim and stuck there. Even mount() itself has GFP_KERNEL allocations in its path, so it still might stuck there even with your patch. I think this should be handled on nfs/vfs level by making inode eviction during reclaim more asynchronous.
On Mon, Dec 2, 2019 at 8:37 AM Andrey Ryabinin <aryabinin@virtuozzo.com> wrote: > > > On 11/30/19 12:45 AM, Pavel Tikhomirov wrote: > > We have a problem that shrinker_rwsem can be held for a long time for > > read in shrink_slab, at the same time any process which is trying to > > manage shrinkers hangs. > > > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. > > It tries to shrink something on nfs (hard) but nfs server is dead at > > these moment already and rpc will never succeed. Generally any shrinker > > can take significant time to do_shrink_slab, so it's a bad idea to hold > > the list lock here. > > > > We have a similar problem in shrink_slab_memcg, except that we are > > traversing shrinker_map+shrinker_idr there. > > > > The idea of the patch is to inc a refcount to the chosen shrinker so it > > won't disappear and release shrinker_rwsem while we are in > > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec > > the refcount and continue the traversal. > > > > We also need a wait_queue so that unregister_shrinker can wait for the > > refcnt to become zero. Only after these we can safely remove the > > shrinker from list and idr, and free the shrinker. > > > > I've reproduced the nfs hang in do_shrink_slab with the patch applied on > > ms kernel, all other mount/unmount pass fine without any hang. > > > > Here is a reproduction on kernel without patch: > > > > 1) Setup nfs on server node with some files in it (e.g. 200) > > > > [server]# cat /etc/exports > > /vz/nfs2 *(ro,no_root_squash,no_subtree_check,async) > > > > 2) Hard mount it on client node > > > > [client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt > > > > 3) Open some (e.g. 200) files on the mount > > > > [client]# for i in $(find /mnt/ -type f | head -n 200); \ > > do setsid sleep 1000 &>/dev/null <$i & done > > > > 4) Kill all openers > > > > [client]# killall sleep -9 > > > > 5) Put your network cable out on client node > > > > 6) Drop caches on the client, it will hang on nfs while holding > > shrinker_rwsem lock for read > > > > [client]# echo 3 > /proc/sys/vm/drop_caches > > > > crash> bt ... > > PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" > > #0 [...] __schedule at ... > > #1 [...] schedule at ... > > #2 [...] rpc_wait_bit_killable at ... [sunrpc] > > #3 [...] __wait_on_bit at ... > > #4 [...] out_of_line_wait_on_bit at ... > > #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] > > #6 [...] nfs4_proc_delegreturn at ... [nfsv4] > > #7 [...] nfs_do_return_delegation at ... [nfsv4] > > #8 [...] nfs4_evict_inode at ... [nfsv4] > > #9 [...] evict at ... > > #10 [...] dispose_list at ... > > #11 [...] prune_icache_sb at ... > > #12 [...] super_cache_scan at ... > > #13 [...] do_shrink_slab at ... > > #14 [...] shrink_slab at ... > > #15 [...] drop_slab_node at ... > > #16 [...] drop_slab at ... > > #17 [...] drop_caches_sysctl_handler at ... > > #18 [...] proc_sys_call_handler at ... > > #19 [...] vfs_write at ... > > #20 [...] ksys_write at ... > > #21 [...] do_syscall_64 at ... > > #22 [...] entry_SYSCALL_64_after_hwframe at ... > > > > 7) All other mount/umount activity now hangs with no luck to take > > shrinker_rwsem for write. > > > > [client]# mount -t tmpfs tmpfs /tmp > > > > crash> bt ... > > PID: 5464 TASK: ... CPU: 3 COMMAND: "mount" > > #0 [...] __schedule at ... > > #1 [...] schedule at ... > > #2 [...] rwsem_down_write_slowpath at ... > > #3 [...] prealloc_shrinker at ... > > #4 [...] alloc_super at ... > > #5 [...] sget at ... > > #6 [...] mount_nodev at ... > > #7 [...] legacy_get_tree at ... > > #8 [...] vfs_get_tree at ... > > #9 [...] do_mount at ... > > #10 [...] ksys_mount at ... > > #11 [...] __x64_sys_mount at ... > > #12 [...] do_syscall_64 at ... > > #13 [...] entry_SYSCALL_64_after_hwframe at ... > > > > > I don't think this patch solves the problem, it only fixes one minor symptom of it. > The actual problem here the reclaim hang in the nfs. > It means that any process, including kswapd, may go into nfs inode reclaim and stuck there. > > Even mount() itself has GFP_KERNEL allocations in its path, so it still might stuck there even with your patch. > > I think this should be handled on nfs/vfs level by making inode eviction during reclaim more asynchronous. Though I agree that we should be fixing shrinkers to not get stuck (and be more async), I still think the problem this patch is solving is worth fixing. On machines running multiple workloads, one job stuck in slab shrinker and blocking all other unrelated jobs wanting shrinker_rwsem, breaks the isolation and causes DoS. Shakeel
On 03.12.2019 03:13, Shakeel Butt wrote: > On Mon, Dec 2, 2019 at 8:37 AM Andrey Ryabinin <aryabinin@virtuozzo.com> wrote: >> >> >> On 11/30/19 12:45 AM, Pavel Tikhomirov wrote: >>> We have a problem that shrinker_rwsem can be held for a long time for >>> read in shrink_slab, at the same time any process which is trying to >>> manage shrinkers hangs. >>> >>> The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. >>> It tries to shrink something on nfs (hard) but nfs server is dead at >>> these moment already and rpc will never succeed. Generally any shrinker >>> can take significant time to do_shrink_slab, so it's a bad idea to hold >>> the list lock here. >>> >>> We have a similar problem in shrink_slab_memcg, except that we are >>> traversing shrinker_map+shrinker_idr there. >>> >>> The idea of the patch is to inc a refcount to the chosen shrinker so it >>> won't disappear and release shrinker_rwsem while we are in >>> do_shrink_slab, after that we will reacquire shrinker_rwsem, dec >>> the refcount and continue the traversal. >>> >>> We also need a wait_queue so that unregister_shrinker can wait for the >>> refcnt to become zero. Only after these we can safely remove the >>> shrinker from list and idr, and free the shrinker. >>> >>> I've reproduced the nfs hang in do_shrink_slab with the patch applied on >>> ms kernel, all other mount/unmount pass fine without any hang. >>> >>> Here is a reproduction on kernel without patch: >>> >>> 1) Setup nfs on server node with some files in it (e.g. 200) >>> >>> [server]# cat /etc/exports >>> /vz/nfs2 *(ro,no_root_squash,no_subtree_check,async) >>> >>> 2) Hard mount it on client node >>> >>> [client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt >>> >>> 3) Open some (e.g. 200) files on the mount >>> >>> [client]# for i in $(find /mnt/ -type f | head -n 200); \ >>> do setsid sleep 1000 &>/dev/null <$i & done >>> >>> 4) Kill all openers >>> >>> [client]# killall sleep -9 >>> >>> 5) Put your network cable out on client node >>> >>> 6) Drop caches on the client, it will hang on nfs while holding >>> shrinker_rwsem lock for read >>> >>> [client]# echo 3 > /proc/sys/vm/drop_caches >>> >>> crash> bt ... >>> PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" >>> #0 [...] __schedule at ... >>> #1 [...] schedule at ... >>> #2 [...] rpc_wait_bit_killable at ... [sunrpc] >>> #3 [...] __wait_on_bit at ... >>> #4 [...] out_of_line_wait_on_bit at ... >>> #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] >>> #6 [...] nfs4_proc_delegreturn at ... [nfsv4] >>> #7 [...] nfs_do_return_delegation at ... [nfsv4] >>> #8 [...] nfs4_evict_inode at ... [nfsv4] >>> #9 [...] evict at ... >>> #10 [...] dispose_list at ... >>> #11 [...] prune_icache_sb at ... >>> #12 [...] super_cache_scan at ... >>> #13 [...] do_shrink_slab at ... >>> #14 [...] shrink_slab at ... >>> #15 [...] drop_slab_node at ... >>> #16 [...] drop_slab at ... >>> #17 [...] drop_caches_sysctl_handler at ... >>> #18 [...] proc_sys_call_handler at ... >>> #19 [...] vfs_write at ... >>> #20 [...] ksys_write at ... >>> #21 [...] do_syscall_64 at ... >>> #22 [...] entry_SYSCALL_64_after_hwframe at ... >>> >>> 7) All other mount/umount activity now hangs with no luck to take >>> shrinker_rwsem for write. >>> >>> [client]# mount -t tmpfs tmpfs /tmp >>> >>> crash> bt ... >>> PID: 5464 TASK: ... CPU: 3 COMMAND: "mount" >>> #0 [...] __schedule at ... >>> #1 [...] schedule at ... >>> #2 [...] rwsem_down_write_slowpath at ... >>> #3 [...] prealloc_shrinker at ... >>> #4 [...] alloc_super at ... >>> #5 [...] sget at ... >>> #6 [...] mount_nodev at ... >>> #7 [...] legacy_get_tree at ... >>> #8 [...] vfs_get_tree at ... >>> #9 [...] do_mount at ... >>> #10 [...] ksys_mount at ... >>> #11 [...] __x64_sys_mount at ... >>> #12 [...] do_syscall_64 at ... >>> #13 [...] entry_SYSCALL_64_after_hwframe at ... >>> >> >> >> I don't think this patch solves the problem, it only fixes one minor symptom of it. >> The actual problem here the reclaim hang in the nfs. >> It means that any process, including kswapd, may go into nfs inode reclaim and stuck there. >> >> Even mount() itself has GFP_KERNEL allocations in its path, so it still might stuck there even with your patch. >> >> I think this should be handled on nfs/vfs level by making inode eviction during reclaim more asynchronous. > > Though I agree that we should be fixing shrinkers to not get stuck > (and be more async), I still think the problem this patch is solving > is worth fixing. On machines running multiple workloads, one job stuck > in slab shrinker and blocking all other unrelated jobs wanting > shrinker_rwsem, breaks the isolation and causes DoS. This makes jobs isolated till global reclaim occurs. In that case any process (including kswapd) become stuck, and nothing will help. But (as it was spoken in this list some time ago) many people configure their workload in the way they don't have global reclaim, and they may get some profit from this.
On 30.11.2019 00:45, Pavel Tikhomirov wrote: > We have a problem that shrinker_rwsem can be held for a long time for > read in shrink_slab, at the same time any process which is trying to > manage shrinkers hangs. > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. > It tries to shrink something on nfs (hard) but nfs server is dead at > these moment already and rpc will never succeed. Generally any shrinker > can take significant time to do_shrink_slab, so it's a bad idea to hold > the list lock here. > > We have a similar problem in shrink_slab_memcg, except that we are > traversing shrinker_map+shrinker_idr there. > > The idea of the patch is to inc a refcount to the chosen shrinker so it > won't disappear and release shrinker_rwsem while we are in > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec > the refcount and continue the traversal. > > We also need a wait_queue so that unregister_shrinker can wait for the > refcnt to become zero. Only after these we can safely remove the > shrinker from list and idr, and free the shrinker. > > I've reproduced the nfs hang in do_shrink_slab with the patch applied on > ms kernel, all other mount/unmount pass fine without any hang. > > Here is a reproduction on kernel without patch: > > 1) Setup nfs on server node with some files in it (e.g. 200) > > [server]# cat /etc/exports > /vz/nfs2 *(ro,no_root_squash,no_subtree_check,async) > > 2) Hard mount it on client node > > [client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt > > 3) Open some (e.g. 200) files on the mount > > [client]# for i in $(find /mnt/ -type f | head -n 200); \ > do setsid sleep 1000 &>/dev/null <$i & done > > 4) Kill all openers > > [client]# killall sleep -9 > > 5) Put your network cable out on client node > > 6) Drop caches on the client, it will hang on nfs while holding > shrinker_rwsem lock for read > > [client]# echo 3 > /proc/sys/vm/drop_caches > > crash> bt ... > PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" > #0 [...] __schedule at ... > #1 [...] schedule at ... > #2 [...] rpc_wait_bit_killable at ... [sunrpc] > #3 [...] __wait_on_bit at ... > #4 [...] out_of_line_wait_on_bit at ... > #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] > #6 [...] nfs4_proc_delegreturn at ... [nfsv4] > #7 [...] nfs_do_return_delegation at ... [nfsv4] > #8 [...] nfs4_evict_inode at ... [nfsv4] > #9 [...] evict at ... > #10 [...] dispose_list at ... > #11 [...] prune_icache_sb at ... > #12 [...] super_cache_scan at ... > #13 [...] do_shrink_slab at ... > #14 [...] shrink_slab at ... > #15 [...] drop_slab_node at ... > #16 [...] drop_slab at ... > #17 [...] drop_caches_sysctl_handler at ... > #18 [...] proc_sys_call_handler at ... > #19 [...] vfs_write at ... > #20 [...] ksys_write at ... > #21 [...] do_syscall_64 at ... > #22 [...] entry_SYSCALL_64_after_hwframe at ... > > 7) All other mount/umount activity now hangs with no luck to take > shrinker_rwsem for write. > > [client]# mount -t tmpfs tmpfs /tmp > > crash> bt ... > PID: 5464 TASK: ... CPU: 3 COMMAND: "mount" > #0 [...] __schedule at ... > #1 [...] schedule at ... > #2 [...] rwsem_down_write_slowpath at ... > #3 [...] prealloc_shrinker at ... > #4 [...] alloc_super at ... > #5 [...] sget at ... > #6 [...] mount_nodev at ... > #7 [...] legacy_get_tree at ... > #8 [...] vfs_get_tree at ... > #9 [...] do_mount at ... > #10 [...] ksys_mount at ... > #11 [...] __x64_sys_mount at ... > #12 [...] do_syscall_64 at ... > #13 [...] entry_SYSCALL_64_after_hwframe at ... > > That is on almost clean and almost mainstream Fedora kernel: > > [client]# uname -a > Linux snorch 5.3.8-200.snorch.fc30.x86_64 #1 SMP Mon Nov 11 16:01:15 MSK > 2019 x86_64 x86_64 x86_64 GNU/Linux > > Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> The approach looks OK for me, and despite it does not solve the problem completely, it improves the isolation. Also, we may use this to finally make shrink_slab() iterations lockless (or you may just send couple of patches on top of that :). Small cosmetic note. I'd removed comments in prealloc_shrinker() and free_preallocated_shrinker() since they just describe obvious actions, and we do not need to comment every line we do. But a small comment is needed about that synchronize_rcu() is to make wake_up on stable memory. > --- > include/linux/memcontrol.h | 6 +++ > include/linux/shrinker.h | 6 +++ > mm/memcontrol.c | 16 ++++++ > mm/vmscan.c | 107 ++++++++++++++++++++++++++++++++++++- > 4 files changed, 134 insertions(+), 1 deletion(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index ae703ea3ef48..3717b94b6aa5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1348,6 +1348,8 @@ extern int memcg_expand_shrinker_maps(int new_id); > > extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg, > int nid, int shrinker_id); > +extern void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, > + int nid, int shrinker_id); > #else > #define mem_cgroup_sockets_enabled 0 > static inline void mem_cgroup_sk_alloc(struct sock *sk) { }; > @@ -1361,6 +1363,10 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, > int nid, int shrinker_id) > { > } > +static inline void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, > + int nid, int shrinker_id) > +{ > +} > #endif > > struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep); > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 0f80123650e2..dd3bb43ed58d 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -2,6 +2,9 @@ > #ifndef _LINUX_SHRINKER_H > #define _LINUX_SHRINKER_H > > +#include <linux/refcount.h> > +#include <linux/wait.h> > + > /* > * This struct is used to pass information from page reclaim to the shrinkers. > * We consolidate the values for easier extention later. > @@ -75,6 +78,9 @@ struct shrinker { > #endif > /* objs pending delete, per node */ > atomic_long_t *nr_deferred; > + > + refcount_t refcnt; > + wait_queue_head_t wq; > }; > #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 01f3f8b665e9..81f45124feb7 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -442,6 +442,22 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id) > } > } > > +void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, > + int nid, int shrinker_id) > +{ > + struct memcg_shrinker_map *map; > + > + /* > + * The map for refcounted memcg can only be freed in > + * memcg_free_shrinker_map_rcu so we can safely protect > + * map with rcu_read_lock. > + */ > + rcu_read_lock(); > + map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map); > + clear_bit(shrinker_id, map->map); > + rcu_read_unlock(); > +} > + > /** > * mem_cgroup_css_from_page - css of the memcg associated with a page > * @page: page of interest > diff --git a/mm/vmscan.c b/mm/vmscan.c > index ee4eecc7e1c2..59e46d65e902 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -393,6 +393,13 @@ int prealloc_shrinker(struct shrinker *shrinker) > if (!shrinker->nr_deferred) > return -ENOMEM; > > + /* > + * Shrinker is not yet visible through shrinker_idr or shrinker_list, > + * so no locks required for initialization. > + */ > + refcount_set(&shrinker->refcnt, 1); > + init_waitqueue_head(&shrinker->wq); > + > if (shrinker->flags & SHRINKER_MEMCG_AWARE) { > if (prealloc_memcg_shrinker(shrinker)) > goto free_deferred; > @@ -411,6 +418,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker) > if (!shrinker->nr_deferred) > return; > > + /* The shrinker shouldn't be used at these point. */ > + WARN_ON(!refcount_dec_and_test(&shrinker->refcnt)); > + > if (shrinker->flags & SHRINKER_MEMCG_AWARE) > unregister_memcg_shrinker(shrinker); > > @@ -447,6 +457,15 @@ void unregister_shrinker(struct shrinker *shrinker) > { > if (!shrinker->nr_deferred) > return; > + > + /* > + * If refcnt is not zero we need to wait these shrinker to finish all > + * it's do_shrink_slab() calls. > + */ > + if (!refcount_dec_and_test(&shrinker->refcnt)) > + wait_event(shrinker->wq, > + (refcount_read(&shrinker->refcnt) == 0)); > + > if (shrinker->flags & SHRINKER_MEMCG_AWARE) > unregister_memcg_shrinker(shrinker); > down_write(&shrinker_rwsem); > @@ -454,6 +473,9 @@ void unregister_shrinker(struct shrinker *shrinker) > up_write(&shrinker_rwsem); > kfree(shrinker->nr_deferred); > shrinker->nr_deferred = NULL; > + > + /* Pairs with rcu_read_lock in put_shrinker() */ > + synchronize_rcu(); > } > EXPORT_SYMBOL(unregister_shrinker); > > @@ -589,6 +611,42 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > return freed; > } > > +struct shrinker *get_shrinker(struct shrinker *shrinker) > +{ > + /* > + * Pairs with refcnt dec in unregister_shrinker(), if refcnt is zero > + * these shrinker is already in the middle of unregister_shrinker() and > + * we can't use it. > + */ > + if (!refcount_inc_not_zero(&shrinker->refcnt)) > + shrinker = NULL; > + return shrinker; > +} > + > +void put_shrinker(struct shrinker *shrinker) > +{ > + /* > + * The rcu_read_lock pairs with synchronize_rcu() in > + * unregister_shrinker(), so that the shrinker is not freed > + * before the wake_up. > + */ > + rcu_read_lock(); > + if (!refcount_dec_and_test(&shrinker->refcnt)) { > + /* > + * Pairs with smp_mb in > + * wait_event()->prepare_to_wait() > + */ > + smp_mb(); > + /* > + * If refcnt becomes zero, we already have an > + * unregister_shrinker() waiting for us to finish. > + */ > + if (waitqueue_active(&shrinker->wq)) > + wake_up(&shrinker->wq); > + } > + rcu_read_unlock(); > +} > + > #ifdef CONFIG_MEMCG > static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, int priority) > @@ -628,9 +686,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, > !(shrinker->flags & SHRINKER_NONSLAB)) > continue; > > + /* > + * Take a refcnt on a shrinker so that it can't be freed or > + * removed from shrinker_idr (and shrinker_list). These way we > + * can safely release shrinker_rwsem. > + * > + * We need to release shrinker_rwsem here as do_shrink_slab can > + * take too much time to finish (e.g. on nfs). And holding > + * global shrinker_rwsem can block registring and unregistring > + * of shrinkers. > + */ > + if (!get_shrinker(shrinker)) > + continue; > + up_read(&shrinker_rwsem); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) { > - clear_bit(i, map->map); > + memcg_clear_shrinker_bit(memcg, nid, i); > /* > * After the shrinker reported that it had no objects to > * free, but before we cleared the corresponding bit in > @@ -655,6 +727,22 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, > } > freed += ret; > > + /* > + * Re-aquire shrinker_rwsem, put refcount and reload > + * shrinker_map as it can be modified in > + * memcg_expand_one_shrinker_map if new shrinkers > + * were registred in the meanwhile. > + */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + freed = freed ? : 1; > + put_shrinker(shrinker); > + return freed; > + } > + put_shrinker(shrinker); > + map = rcu_dereference_protected( > + memcg->nodeinfo[nid]->shrinker_map, > + true); > + > if (rwsem_is_contended(&shrinker_rwsem)) { > freed = freed ? : 1; > break; > @@ -719,10 +807,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, > .memcg = memcg, > }; > > + /* See comment in shrink_slab_memcg. */ > + if (!get_shrinker(shrinker)) > + continue; > + up_read(&shrinker_rwsem); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > + /* > + * We can safely continue traverse of the shrinker_list as > + * our shrinker is still on the list due to refcount. > + */ > + if (!down_read_trylock(&shrinker_rwsem)) { > + freed = freed ? : 1; > + put_shrinker(shrinker); > + goto out; > + } > + put_shrinker(shrinker); > + > /* > * Bail out if someone want to register a new shrinker to > * prevent the regsitration from being stalled for long periods >
On Sat 30-11-19 00:45:41, Pavel Tikhomirov wrote: > We have a problem that shrinker_rwsem can be held for a long time for > read in shrink_slab, at the same time any process which is trying to > manage shrinkers hangs. > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. > It tries to shrink something on nfs (hard) but nfs server is dead at > these moment already and rpc will never succeed. Generally any shrinker > can take significant time to do_shrink_slab, so it's a bad idea to hold > the list lock here. Yes, this is a known problem and people have already tried to address it in the past. Have you checked previous attempts? SRCU based one http://lkml.kernel.org/r/153365347929.19074.12509495712735843805.stgit@localhost.localdomain but I believe there were others (I only had this one in my notes). Please make sure to Cc Dave Chinner when posting a next version because he had some concerns about the change of the behavior. > We have a similar problem in shrink_slab_memcg, except that we are > traversing shrinker_map+shrinker_idr there. > > The idea of the patch is to inc a refcount to the chosen shrinker so it > won't disappear and release shrinker_rwsem while we are in > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec > the refcount and continue the traversal. The reference count part makes sense to me. RCU role needs a better explanation. Also do you have any reason to not use completion for the final step? Openconding essentially the same concept sounds a bit awkward to me. > We also need a wait_queue so that unregister_shrinker can wait for the > refcnt to become zero. Only after these we can safely remove the > shrinker from list and idr, and free the shrinker. [...] > crash> bt ... > PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" > #0 [...] __schedule at ... > #1 [...] schedule at ... > #2 [...] rpc_wait_bit_killable at ... [sunrpc] > #3 [...] __wait_on_bit at ... > #4 [...] out_of_line_wait_on_bit at ... > #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] > #6 [...] nfs4_proc_delegreturn at ... [nfsv4] > #7 [...] nfs_do_return_delegation at ... [nfsv4] > #8 [...] nfs4_evict_inode at ... [nfsv4] > #9 [...] evict at ... > #10 [...] dispose_list at ... > #11 [...] prune_icache_sb at ... > #12 [...] super_cache_scan at ... > #13 [...] do_shrink_slab at ... Are NFS people aware of this? Because this is simply not acceptable behavior. Memory reclaim cannot be block indefinitely or for a long time. There must be a way to simply give up if the underlying inode cannot be reclaimed. I still have to think about the proposed solution. It sounds a bit over complicated to me.
> I don't think this patch solves the problem, it only fixes one minor symptom of it. > The actual problem here the reclaim hang in the nfs. > It means that any process, including kswapd, may go into nfs inode reclaim and stuck there. > > Even mount() itself has GFP_KERNEL allocations in its path, so it still might stuck there even with your patch. > > I think this should be handled on nfs/vfs level by making inode eviction during reclaim more asynchronous. I see this bigger problem here, but fixing every shrinker (who knows what else except nfs can hang here) looks like a too big work. My patch allows us to separate bad shrinker from others at least before global reclaim. Best regards, Tikhomirov Pavel.
> The approach looks OK for me, and despite it does not solve the problem > completely, it improves the isolation. Also, we may use this to finally > make shrink_slab() iterations lockless (or you may just send couple of > patches on top of that :). If you mean replacing rwsem with srcu it looks like a good idea, but it can be done later separately on top of it (if we succeed with this patch =). > > Small cosmetic note. I'd removed comments in prealloc_shrinker() and > free_preallocated_shrinker() since they just describe obvious actions, > and we do not need to comment every line we do. But a small comment > is needed about that synchronize_rcu() is to make wake_up on stable > memory. I've just tried hard to make my patch as self explaining as possible, I can remove these (or some more) comments In next versions if they are excess. Best regards, Tikhomirov Pavel.
[please cc me on future shrinker infrastructure modifications] On Mon, Dec 02, 2019 at 07:36:03PM +0300, Andrey Ryabinin wrote: > > On 11/30/19 12:45 AM, Pavel Tikhomirov wrote: > > We have a problem that shrinker_rwsem can be held for a long time for > > read in shrink_slab, at the same time any process which is trying to > > manage shrinkers hangs. > > > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. > > It tries to shrink something on nfs (hard) but nfs server is dead at > > these moment already and rpc will never succeed. Generally any shrinker > > can take significant time to do_shrink_slab, so it's a bad idea to hold > > the list lock here. registering/unregistering a shrinker is not a performance critical task. If a shrinker is blocking for a long time, then we need to work to fix the shrinker implementation because blocking is a much bigger problem than just register/unregister. > > The idea of the patch is to inc a refcount to the chosen shrinker so it > > won't disappear and release shrinker_rwsem while we are in > > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec > > the refcount and continue the traversal. This is going to cause a *lot* of traffic on the shrinker rwsem. It's already a pretty hot lock on large machines under memory pressure (think thousands of tasks all doing direct reclaim across hundreds of CPUs), and so changing them to cycle the rwsem on every shrinker that will only make this worse. Esepcially when we consider that there may be hundreds to thousands of registered shrinker instances on large machines. As an example of how frequent cycling of a global lock in shrinker instances causes issues, we used to take references to superblock shrinker count invocations to guarantee existence. This was found to be a scalability limitation when lots of near-empty superblocks were present in a system (see commit d23da150a37c ("fs/superblock: avoid locking counting inodes and dentries before reclaiming them")). This alleviated the problem for a while, but soon we had problems with just taking a reference to the superblock in the callbacks that did actual work. Hence we changed it to just take a per-superblock rwsem to get rid of the global sb_lock spinlock in this path. See commit eb6ef3df4faa ("trylock_super(): replacement for grab_super_passive()". Now we don't have a scalability problem. IOWs, we already know that cycling a global rwsem on every individual shrinker invocation is going to cause noticable scalability problems. Hence I don't think that this sort of "cycle the global rwsem faster to reduce [un]register latency" solution is going to fly because of the runtime performance regressions it will introduce.... > I don't think this patch solves the problem, it only fixes one minor symptom of it. > The actual problem here the reclaim hang in the nfs. The nfs client is waiting on the NFS server to respond. It may actually be that the server has hung, not the client... > It means that any process, including kswapd, may go into nfs inode reclaim and stuck there. *nod* > I think this should be handled on nfs/vfs level by making inode eviction during reclaim more asynchronous. That's what we are trying to do with similar blocking based issues in XFS inode reclaim. It's not simple, though, because these days memory reclaim is like a bowl full of spaghetti covered with a delicious sauce of non-obvious heuristics and broken functionality.... Cheers, Dave.
On Fri 06-12-19 13:09:53, Dave Chinner wrote: > [please cc me on future shrinker infrastructure modifications] > > On Mon, Dec 02, 2019 at 07:36:03PM +0300, Andrey Ryabinin wrote: > > > > On 11/30/19 12:45 AM, Pavel Tikhomirov wrote: > > > We have a problem that shrinker_rwsem can be held for a long time for > > > read in shrink_slab, at the same time any process which is trying to > > > manage shrinkers hangs. > > > > > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. > > > It tries to shrink something on nfs (hard) but nfs server is dead at > > > these moment already and rpc will never succeed. Generally any shrinker > > > can take significant time to do_shrink_slab, so it's a bad idea to hold > > > the list lock here. > > registering/unregistering a shrinker is not a performance critical > task. We have had a bug report from production system which stumbled over a long [u]nmount which was stuck on a shrinker {un}register path waiting for the lock. This wouldn't be a big deal most of the time but [u]mount were part of the login session in this case. This was on an older distribution kernel and e496612c5130 ("mm: do not stall register_shrinker()") helped a bit but not really for shrinkers that take a lot of time. > If a shrinker is blocking for a long time, then we need to > work to fix the shrinker implementation because blocking is a much > bigger problem than just register/unregister. Absolutely agreed here.
On Thu, Dec 5, 2019 at 6:10 PM Dave Chinner <david@fromorbit.com> wrote: > > [please cc me on future shrinker infrastructure modifications] > > On Mon, Dec 02, 2019 at 07:36:03PM +0300, Andrey Ryabinin wrote: > > > > On 11/30/19 12:45 AM, Pavel Tikhomirov wrote: > > > We have a problem that shrinker_rwsem can be held for a long time for > > > read in shrink_slab, at the same time any process which is trying to > > > manage shrinkers hangs. > > > > > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. > > > It tries to shrink something on nfs (hard) but nfs server is dead at > > > these moment already and rpc will never succeed. Generally any shrinker > > > can take significant time to do_shrink_slab, so it's a bad idea to hold > > > the list lock here. > > registering/unregistering a shrinker is not a performance critical > task. Yes, not performance critical but it can cause isolation issues. > If a shrinker is blocking for a long time, then we need to > work to fix the shrinker implementation because blocking is a much > bigger problem than just register/unregister. > Yes, we should be fixing the implementations of all shrinkers and yes it is bigger issue but we can also fix register/unregister isolation issue in parallel. Fixing all shrinkers would a tedious and long task and we should not block fixing isolation issue on it. > > > The idea of the patch is to inc a refcount to the chosen shrinker so it > > > won't disappear and release shrinker_rwsem while we are in > > > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec > > > the refcount and continue the traversal. > > This is going to cause a *lot* of traffic on the shrinker rwsem. > It's already a pretty hot lock on large machines under memory > pressure (think thousands of tasks all doing direct reclaim across > hundreds of CPUs), and so changing them to cycle the rwsem on every > shrinker that will only make this worse. Esepcially when we consider > that there may be hundreds to thousands of registered shrinker > instances on large machines. > > As an example of how frequent cycling of a global lock in shrinker > instances causes issues, we used to take references to superblock > shrinker count invocations to guarantee existence. This was found to > be a scalability limitation when lots of near-empty superblocks were > present in a system (see commit d23da150a37c ("fs/superblock: avoid > locking counting inodes and dentries before reclaiming them")). > > This alleviated the problem for a while, but soon we had problems > with just taking a reference to the superblock in the callbacks that > did actual work. Hence we changed it to just take a per-superblock > rwsem to get rid of the global sb_lock spinlock in this path. See > commit eb6ef3df4faa ("trylock_super(): replacement for > grab_super_passive()". Now we don't have a scalability problem. > > IOWs, we already know that cycling a global rwsem on every > individual shrinker invocation is going to cause noticable > scalability problems. Hence I don't think that this sort of "cycle > the global rwsem faster to reduce [un]register latency" solution is > going to fly because of the runtime performance regressions it will > introduce.... > I agree with your scalability concern (though others would argue to first demonstrate the issue before adding more sophisticated scalable code). Most memory reclaim code is written without the performance or scalability concern, maybe we should switch our thinking. > > I don't think this patch solves the problem, it only fixes one minor symptom of it. > > The actual problem here the reclaim hang in the nfs. > > The nfs client is waiting on the NFS server to respond. It may > actually be that the server has hung, not the client... > > > It means that any process, including kswapd, may go into nfs inode reclaim and stuck there. > > *nod* > > > I think this should be handled on nfs/vfs level by making inode eviction during reclaim more asynchronous. > > That's what we are trying to do with similar blocking based issues > in XFS inode reclaim. It's not simple, though, because these days > memory reclaim is like a bowl full of spaghetti covered with a > delicious sauce of non-obvious heuristics and broken > functionality.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Dec 06, 2019 at 09:11:25AM -0800, Shakeel Butt wrote: > On Thu, Dec 5, 2019 at 6:10 PM Dave Chinner <david@fromorbit.com> wrote: > > If a shrinker is blocking for a long time, then we need to > > work to fix the shrinker implementation because blocking is a much > > bigger problem than just register/unregister. > > > > Yes, we should be fixing the implementations of all shrinkers and yes > it is bigger issue but we can also fix register/unregister isolation > issue in parallel. Fixing all shrinkers would a tedious and long task > and we should not block fixing isolation issue on it. "fixing all shrinkers" is a bit of hyperbole - you've identified only one instance where blocking is causing you problems. Indeed, most shrinkers are already non-blocking and won't cause you any problems at all. > > IOWs, we already know that cycling a global rwsem on every > > individual shrinker invocation is going to cause noticable > > scalability problems. Hence I don't think that this sort of "cycle > > the global rwsem faster to reduce [un]register latency" solution is > > going to fly because of the runtime performance regressions it will > > introduce.... > > > > I agree with your scalability concern (though others would argue to > first demonstrate the issue before adding more sophisticated scalable > code). Look at the git history. We *know* this is a problem, so anyone arguing that we have to prove it can go take a long walk of a short plank.... > Most memory reclaim code is written without the performance or > scalability concern, maybe we should switch our thinking. I think there's a lot of core mm and other developers that would disagree with you there. With respect to shrinkers, we've been directly concerned about performance and scalability of the individual instances as well as the infrastructure for at least the last decade.... Cheers, Dave.
On 12/4/19 11:35 AM, Michal Hocko wrote: > On Sat 30-11-19 00:45:41, Pavel Tikhomirov wrote: >> We have a problem that shrinker_rwsem can be held for a long time for >> read in shrink_slab, at the same time any process which is trying to >> manage shrinkers hangs. >> >> The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. >> It tries to shrink something on nfs (hard) but nfs server is dead at >> these moment already and rpc will never succeed. Generally any shrinker >> can take significant time to do_shrink_slab, so it's a bad idea to hold >> the list lock here. > > Yes, this is a known problem and people have already tried to address it > in the past. Have you checked previous attempts? SRCU based one > http://lkml.kernel.org/r/153365347929.19074.12509495712735843805.stgit@localhost.localdomain > but I believe there were others (I only had this one in my notes). > Please make sure to Cc Dave Chinner when posting a next version because > he had some concerns about the change of the behavior. The approach of the patch you are referencing is quiet different, it will still hold global srcu_read_lock(&srcu) when diving in do_shrink_slab and hanging nfs will still block all [un]register_shrinker. > >> We have a similar problem in shrink_slab_memcg, except that we are >> traversing shrinker_map+shrinker_idr there. >> >> The idea of the patch is to inc a refcount to the chosen shrinker so it >> won't disappear and release shrinker_rwsem while we are in >> do_shrink_slab, after that we will reacquire shrinker_rwsem, dec >> the refcount and continue the traversal. > > The reference count part makes sense to me. RCU role needs a better > explanation. I have 2 rcu's in patch, 1-st is to protect shrinker_map same as it was before in memcg_set_shrinker_bit, 2-nd is to protect shrinker struct in put_shrinker from being freed, as unregister_shrinker can see refcnt==0 without actually going to schedule(). > Also do you have any reason to not use completion for > the final step? Openconding essentially the same concept sounds a bit > awkward to me. Thanks for a good hint, from the first glance we can rework wait_event part to wait_for_completion. > >> We also need a wait_queue so that unregister_shrinker can wait for the >> refcnt to become zero. Only after these we can safely remove the >> shrinker from list and idr, and free the shrinker. > [...] >> crash> bt ... >> PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" >> #0 [...] __schedule at ... >> #1 [...] schedule at ... >> #2 [...] rpc_wait_bit_killable at ... [sunrpc] >> #3 [...] __wait_on_bit at ... >> #4 [...] out_of_line_wait_on_bit at ... >> #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] >> #6 [...] nfs4_proc_delegreturn at ... [nfsv4] >> #7 [...] nfs_do_return_delegation at ... [nfsv4] >> #8 [...] nfs4_evict_inode at ... [nfsv4] >> #9 [...] evict at ... >> #10 [...] dispose_list at ... >> #11 [...] prune_icache_sb at ... >> #12 [...] super_cache_scan at ... >> #13 [...] do_shrink_slab at ... > > Are NFS people aware of this? Because this is simply not acceptable > behavior. Memory reclaim cannot be block indefinitely or for a long > time. There must be a way to simply give up if the underlying inode > cannot be reclaimed. Sorry that I didn't cc nfs people from the begining. > > I still have to think about the proposed solution. It sounds a bit over > complicated to me. >
On 12/10/19 4:20 AM, Dave Chinner wrote: > On Fri, Dec 06, 2019 at 09:11:25AM -0800, Shakeel Butt wrote: >> On Thu, Dec 5, 2019 at 6:10 PM Dave Chinner <david@fromorbit.com> wrote: >>> If a shrinker is blocking for a long time, then we need to >>> work to fix the shrinker implementation because blocking is a much >>> bigger problem than just register/unregister. >>> >> >> Yes, we should be fixing the implementations of all shrinkers and yes >> it is bigger issue but we can also fix register/unregister isolation >> issue in parallel. Fixing all shrinkers would a tedious and long task >> and we should not block fixing isolation issue on it. > > "fixing all shrinkers" is a bit of hyperbole - you've identified > only one instance where blocking is causing you problems. Indeed, > most shrinkers are already non-blocking and won't cause you any > problems at all. > >>> IOWs, we already know that cycling a global rwsem on every >>> individual shrinker invocation is going to cause noticable >>> scalability problems. Hence I don't think that this sort of "cycle >>> the global rwsem faster to reduce [un]register latency" solution is >>> going to fly because of the runtime performance regressions it will >>> introduce.... >>> >> >> I agree with your scalability concern (though others would argue to >> first demonstrate the issue before adding more sophisticated scalable >> code). > > Look at the git history. We *know* this is a problem, so anyone > arguing that we have to prove it can go take a long walk of a short > plank.... > >> Most memory reclaim code is written without the performance or >> scalability concern, maybe we should switch our thinking. > > I think there's a lot of core mm and other developers that would > disagree with you there. With respect to shrinkers, we've been > directly concerned about performance and scalability of the > individual instances as well as the infrastructure for at least the > last decade.... > > Cheers, > > Dave. > Thanks a lot for your replies, now I see that the core of the problem is in nfs hang, before that I was unsure if it's OK to have such a hang in do_shrink_slab. I have a possibly bad idea on how my patch can still work. What if we use unlock/refcount way only for nfs-shrinkers? It will still give a performance penalty if one has many nfs mounts, but for those who has little number of nfs mounts the penalty would be less. And this would be a small isolation improvement for nfs users.
Hi all, I can still reproduce this problem on the latest linux version. The nfs long time requests will cause the other mount/unmount hang. I checked the thread and found Tikhomirov Pavel did not cc NFS people. So I reply this email to cc them. Is it a problem of NFS that holding shrinker_rwsem so long time? And is there some thought to fix it? On 11/30/19 12:45 AM, Pavel Tikhomirov wrote: >We have a problem that shrinker_rwsem can be held for a long time for >read in shrink_slab, at the same time any process which is trying to >manage shrinkers hangs. > >The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. >It tries to shrink something on nfs (hard) but nfs server is dead at >these moment already and rpc will never succeed. Generally any shrinker >can take significant time to do_shrink_slab, so it's a bad idea to hold >the list lock here. > >We have a similar problem in shrink_slab_memcg, except that we are >traversing shrinker_map+shrinker_idr there. > >The idea of the patch is to inc a refcount to the chosen shrinker so it >won't disappear and release shrinker_rwsem while we are in >do_shrink_slab, after that we will reacquire shrinker_rwsem, dec >the refcount and continue the traversal. > >We also need a wait_queue so that unregister_shrinker can wait for the >refcnt to become zero. Only after these we can safely remove the >shrinker from list and idr, and free the shrinker. > >I've reproduced the nfs hang in do_shrink_slab with the patch applied on >ms kernel, all other mount/unmount pass fine without any hang. > >Here is a reproduction on kernel without patch: > >1) Setup nfs on server node with some files in it (e.g. 200) > >[server]# cat /etc/exports >/vz/nfs2 *(ro,no_root_squash,no_subtree_check,async) > >2) Hard mount it on client node > >[client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt > >3) Open some (e.g. 200) files on the mount > >[client]# for i in $(find /mnt/ -type f | head -n 200); \ > do setsid sleep 1000 &>/dev/null <$i & done > >4) Kill all openers > >[client]# killall sleep -9 > >5) Put your network cable out on client node > >6) Drop caches on the client, it will hang on nfs while holding > shrinker_rwsem lock for read > >[client]# echo 3 > /proc/sys/vm/drop_caches > > crash> bt ... > PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" > #0 [...] __schedule at ... > #1 [...] schedule at ... > #2 [...] rpc_wait_bit_killable at ... [sunrpc] > #3 [...] __wait_on_bit at ... > #4 [...] out_of_line_wait_on_bit at ... > #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] > #6 [...] nfs4_proc_delegreturn at ... [nfsv4] > #7 [...] nfs_do_return_delegation at ... [nfsv4] > #8 [...] nfs4_evict_inode at ... [nfsv4] > #9 [...] evict at ... > #10 [...] dispose_list at ... > #11 [...] prune_icache_sb at ... > #12 [...] super_cache_scan at ... > #13 [...] do_shrink_slab at ... > #14 [...] shrink_slab at ... > #15 [...] drop_slab_node at ... > #16 [...] drop_slab at ... > #17 [...] drop_caches_sysctl_handler at ... > #18 [...] proc_sys_call_handler at ... > #19 [...] vfs_write at ... > #20 [...] ksys_write at ... > #21 [...] do_syscall_64 at ... > #22 [...] entry_SYSCALL_64_after_hwframe at ... > >7) All other mount/umount activity now hangs with no luck to take > shrinker_rwsem for write. > >[client]# mount -t tmpfs tmpfs /tmp > > crash> bt ... > PID: 5464 TASK: ... CPU: 3 COMMAND: "mount" > #0 [...] __schedule at ... > #1 [...] schedule at ... > #2 [...] rwsem_down_write_slowpath at ... > #3 [...] prealloc_shrinker at ... > #4 [...] alloc_super at ... > #5 [...] sget at ... > #6 [...] mount_nodev at ... > #7 [...] legacy_get_tree at ... > #8 [...] vfs_get_tree at ... > #9 [...] do_mount at ... > #10 [...] ksys_mount at ... > #11 [...] __x64_sys_mount at ... > #12 [...] do_syscall_64 at ... > #13 [...] entry_SYSCALL_64_after_hwframe at ... > >That is on almost clean and almost mainstream Fedora kernel: > >[client]# uname -a >Linux snorch 5.3.8-200.snorch.fc30.x86_64 #1 SMP Mon Nov 11 16:01:15 MSK > 2019 x86_64 x86_64 x86_64 GNU/Linux > >Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> >--- > include/linux/memcontrol.h | 6 +++ > include/linux/shrinker.h | 6 +++ > mm/memcontrol.c | 16 ++++++ > mm/vmscan.c | 107 ++++++++++++++++++++++++++++++++++++- > 4 files changed, 134 insertions(+), 1 deletion(-) > >diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >index ae703ea3ef48..3717b94b6aa5 100644 >--- a/include/linux/memcontrol.h >+++ b/include/linux/memcontrol.h >@@ -1348,6 +1348,8 @@ extern int memcg_expand_shrinker_maps(int new_id); > > extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg, > int nid, int shrinker_id); >+extern void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, >+ int nid, int shrinker_id); > #else > #define mem_cgroup_sockets_enabled 0 > static inline void mem_cgroup_sk_alloc(struct sock *sk) { }; >@@ -1361,6 +1363,10 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, > int nid, int shrinker_id) > { > } >+static inline void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, >+ int nid, int shrinker_id) >+{ >+} > #endif > > struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep); >diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h >index 0f80123650e2..dd3bb43ed58d 100644 >--- a/include/linux/shrinker.h >+++ b/include/linux/shrinker.h >@@ -2,6 +2,9 @@ > #ifndef _LINUX_SHRINKER_H > #define _LINUX_SHRINKER_H > >+#include <linux/refcount.h> >+#include <linux/wait.h> >+ > /* > * This struct is used to pass information from page reclaim to the shrinkers. > * We consolidate the values for easier extention later. >@@ -75,6 +78,9 @@ struct shrinker { > #endif > /* objs pending delete, per node */ > atomic_long_t *nr_deferred; >+ >+ refcount_t refcnt; >+ wait_queue_head_t wq; > }; > #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ > >diff --git a/mm/memcontrol.c b/mm/memcontrol.c >index 01f3f8b665e9..81f45124feb7 100644 >--- a/mm/memcontrol.c >+++ b/mm/memcontrol.c >@@ -442,6 +442,22 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id) > } > } > >+void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, >+ int nid, int shrinker_id) >+{ >+ struct memcg_shrinker_map *map; >+ >+ /* >+ * The map for refcounted memcg can only be freed in >+ * memcg_free_shrinker_map_rcu so we can safely protect >+ * map with rcu_read_lock. >+ */ >+ rcu_read_lock(); >+ map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map); >+ clear_bit(shrinker_id, map->map); >+ rcu_read_unlock(); >+} >+ > /** > * mem_cgroup_css_from_page - css of the memcg associated with a page > * @page: page of interest >diff --git a/mm/vmscan.c b/mm/vmscan.c >index ee4eecc7e1c2..59e46d65e902 100644 >--- a/mm/vmscan.c >+++ b/mm/vmscan.c >@@ -393,6 +393,13 @@ int prealloc_shrinker(struct shrinker *shrinker) > if (!shrinker->nr_deferred) > return -ENOMEM; > >+ /* >+ * Shrinker is not yet visible through shrinker_idr or shrinker_list, >+ * so no locks required for initialization. >+ */ >+ refcount_set(&shrinker->refcnt, 1); >+ init_waitqueue_head(&shrinker->wq); >+ > if (shrinker->flags & SHRINKER_MEMCG_AWARE) { > if (prealloc_memcg_shrinker(shrinker)) > goto free_deferred; >@@ -411,6 +418,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker) > if (!shrinker->nr_deferred) > return; > >+ /* The shrinker shouldn't be used at these point. */ >+ WARN_ON(!refcount_dec_and_test(&shrinker->refcnt)); >+ > if (shrinker->flags & SHRINKER_MEMCG_AWARE) > unregister_memcg_shrinker(shrinker); > >@@ -447,6 +457,15 @@ void unregister_shrinker(struct shrinker *shrinker) > { > if (!shrinker->nr_deferred) > return; >+ >+ /* >+ * If refcnt is not zero we need to wait these shrinker to finish all >+ * it's do_shrink_slab() calls. >+ */ >+ if (!refcount_dec_and_test(&shrinker->refcnt)) >+ wait_event(shrinker->wq, >+ (refcount_read(&shrinker->refcnt) == 0)); >+ > if (shrinker->flags & SHRINKER_MEMCG_AWARE) > unregister_memcg_shrinker(shrinker); > down_write(&shrinker_rwsem); >@@ -454,6 +473,9 @@ void unregister_shrinker(struct shrinker *shrinker) > up_write(&shrinker_rwsem); > kfree(shrinker->nr_deferred); > shrinker->nr_deferred = NULL; >+ >+ /* Pairs with rcu_read_lock in put_shrinker() */ >+ synchronize_rcu(); > } > EXPORT_SYMBOL(unregister_shrinker); > >@@ -589,6 +611,42 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > return freed; > } > >+struct shrinker *get_shrinker(struct shrinker *shrinker) >+{ >+ /* >+ * Pairs with refcnt dec in unregister_shrinker(), if refcnt is zero >+ * these shrinker is already in the middle of unregister_shrinker() and >+ * we can't use it. >+ */ >+ if (!refcount_inc_not_zero(&shrinker->refcnt)) >+ shrinker = NULL; >+ return shrinker; >+} >+ >+void put_shrinker(struct shrinker *shrinker) >+{ >+ /* >+ * The rcu_read_lock pairs with synchronize_rcu() in >+ * unregister_shrinker(), so that the shrinker is not freed >+ * before the wake_up. >+ */ >+ rcu_read_lock(); >+ if (!refcount_dec_and_test(&shrinker->refcnt)) { >+ /* >+ * Pairs with smp_mb in >+ * wait_event()->prepare_to_wait() >+ */ >+ smp_mb(); >+ /* >+ * If refcnt becomes zero, we already have an >+ * unregister_shrinker() waiting for us to finish. >+ */ >+ if (waitqueue_active(&shrinker->wq)) >+ wake_up(&shrinker->wq); >+ } >+ rcu_read_unlock(); >+} >+ > #ifdef CONFIG_MEMCG > static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, int priority) >@@ -628,9 +686,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, > !(shrinker->flags & SHRINKER_NONSLAB)) > continue; > >+ /* >+ * Take a refcnt on a shrinker so that it can't be freed or >+ * removed from shrinker_idr (and shrinker_list). These way we >+ * can safely release shrinker_rwsem. >+ * >+ * We need to release shrinker_rwsem here as do_shrink_slab can >+ * take too much time to finish (e.g. on nfs). And holding >+ * global shrinker_rwsem can block registring and unregistring >+ * of shrinkers. >+ */ >+ if (!get_shrinker(shrinker)) >+ continue; >+ up_read(&shrinker_rwsem); >+ > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) { >- clear_bit(i, map->map); >+ memcg_clear_shrinker_bit(memcg, nid, i); > /* > * After the shrinker reported that it had no objects to > * free, but before we cleared the corresponding bit in >@@ -655,6 +727,22 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, > } > freed += ret; > >+ /* >+ * Re-aquire shrinker_rwsem, put refcount and reload >+ * shrinker_map as it can be modified in >+ * memcg_expand_one_shrinker_map if new shrinkers >+ * were registred in the meanwhile. >+ */ >+ if (!down_read_trylock(&shrinker_rwsem)) { >+ freed = freed ? : 1; >+ put_shrinker(shrinker); >+ return freed; >+ } >+ put_shrinker(shrinker); >+ map = rcu_dereference_protected( >+ memcg->nodeinfo[nid]->shrinker_map, >+ true); >+ > if (rwsem_is_contended(&shrinker_rwsem)) { > freed = freed ? : 1; > break; >@@ -719,10 +807,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, > .memcg = memcg, > }; > >+ /* See comment in shrink_slab_memcg. */ >+ if (!get_shrinker(shrinker)) >+ continue; >+ up_read(&shrinker_rwsem); >+ > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; >+ >+ /* >+ * We can safely continue traverse of the shrinker_list as >+ * our shrinker is still on the list due to refcount. >+ */ >+ if (!down_read_trylock(&shrinker_rwsem)) { >+ freed = freed ? : 1; >+ put_shrinker(shrinker); >+ goto out; >+ } >+ put_shrinker(shrinker); >+ > /* > * Bail out if someone want to register a new shrinker to > * prevent the regsitration from being stalled for long periods >-- >2.21.0
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ae703ea3ef48..3717b94b6aa5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1348,6 +1348,8 @@ extern int memcg_expand_shrinker_maps(int new_id); extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id); +extern void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, + int nid, int shrinker_id); #else #define mem_cgroup_sockets_enabled 0 static inline void mem_cgroup_sk_alloc(struct sock *sk) { }; @@ -1361,6 +1363,10 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id) { } +static inline void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, + int nid, int shrinker_id) +{ +} #endif struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep); diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 0f80123650e2..dd3bb43ed58d 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -2,6 +2,9 @@ #ifndef _LINUX_SHRINKER_H #define _LINUX_SHRINKER_H +#include <linux/refcount.h> +#include <linux/wait.h> + /* * This struct is used to pass information from page reclaim to the shrinkers. * We consolidate the values for easier extention later. @@ -75,6 +78,9 @@ struct shrinker { #endif /* objs pending delete, per node */ atomic_long_t *nr_deferred; + + refcount_t refcnt; + wait_queue_head_t wq; }; #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 01f3f8b665e9..81f45124feb7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -442,6 +442,22 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id) } } +void memcg_clear_shrinker_bit(struct mem_cgroup *memcg, + int nid, int shrinker_id) +{ + struct memcg_shrinker_map *map; + + /* + * The map for refcounted memcg can only be freed in + * memcg_free_shrinker_map_rcu so we can safely protect + * map with rcu_read_lock. + */ + rcu_read_lock(); + map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map); + clear_bit(shrinker_id, map->map); + rcu_read_unlock(); +} + /** * mem_cgroup_css_from_page - css of the memcg associated with a page * @page: page of interest diff --git a/mm/vmscan.c b/mm/vmscan.c index ee4eecc7e1c2..59e46d65e902 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -393,6 +393,13 @@ int prealloc_shrinker(struct shrinker *shrinker) if (!shrinker->nr_deferred) return -ENOMEM; + /* + * Shrinker is not yet visible through shrinker_idr or shrinker_list, + * so no locks required for initialization. + */ + refcount_set(&shrinker->refcnt, 1); + init_waitqueue_head(&shrinker->wq); + if (shrinker->flags & SHRINKER_MEMCG_AWARE) { if (prealloc_memcg_shrinker(shrinker)) goto free_deferred; @@ -411,6 +418,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker) if (!shrinker->nr_deferred) return; + /* The shrinker shouldn't be used at these point. */ + WARN_ON(!refcount_dec_and_test(&shrinker->refcnt)); + if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); @@ -447,6 +457,15 @@ void unregister_shrinker(struct shrinker *shrinker) { if (!shrinker->nr_deferred) return; + + /* + * If refcnt is not zero we need to wait these shrinker to finish all + * it's do_shrink_slab() calls. + */ + if (!refcount_dec_and_test(&shrinker->refcnt)) + wait_event(shrinker->wq, + (refcount_read(&shrinker->refcnt) == 0)); + if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); down_write(&shrinker_rwsem); @@ -454,6 +473,9 @@ void unregister_shrinker(struct shrinker *shrinker) up_write(&shrinker_rwsem); kfree(shrinker->nr_deferred); shrinker->nr_deferred = NULL; + + /* Pairs with rcu_read_lock in put_shrinker() */ + synchronize_rcu(); } EXPORT_SYMBOL(unregister_shrinker); @@ -589,6 +611,42 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, return freed; } +struct shrinker *get_shrinker(struct shrinker *shrinker) +{ + /* + * Pairs with refcnt dec in unregister_shrinker(), if refcnt is zero + * these shrinker is already in the middle of unregister_shrinker() and + * we can't use it. + */ + if (!refcount_inc_not_zero(&shrinker->refcnt)) + shrinker = NULL; + return shrinker; +} + +void put_shrinker(struct shrinker *shrinker) +{ + /* + * The rcu_read_lock pairs with synchronize_rcu() in + * unregister_shrinker(), so that the shrinker is not freed + * before the wake_up. + */ + rcu_read_lock(); + if (!refcount_dec_and_test(&shrinker->refcnt)) { + /* + * Pairs with smp_mb in + * wait_event()->prepare_to_wait() + */ + smp_mb(); + /* + * If refcnt becomes zero, we already have an + * unregister_shrinker() waiting for us to finish. + */ + if (waitqueue_active(&shrinker->wq)) + wake_up(&shrinker->wq); + } + rcu_read_unlock(); +} + #ifdef CONFIG_MEMCG static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, int priority) @@ -628,9 +686,23 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, !(shrinker->flags & SHRINKER_NONSLAB)) continue; + /* + * Take a refcnt on a shrinker so that it can't be freed or + * removed from shrinker_idr (and shrinker_list). These way we + * can safely release shrinker_rwsem. + * + * We need to release shrinker_rwsem here as do_shrink_slab can + * take too much time to finish (e.g. on nfs). And holding + * global shrinker_rwsem can block registring and unregistring + * of shrinkers. + */ + if (!get_shrinker(shrinker)) + continue; + up_read(&shrinker_rwsem); + ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) { - clear_bit(i, map->map); + memcg_clear_shrinker_bit(memcg, nid, i); /* * After the shrinker reported that it had no objects to * free, but before we cleared the corresponding bit in @@ -655,6 +727,22 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, } freed += ret; + /* + * Re-aquire shrinker_rwsem, put refcount and reload + * shrinker_map as it can be modified in + * memcg_expand_one_shrinker_map if new shrinkers + * were registred in the meanwhile. + */ + if (!down_read_trylock(&shrinker_rwsem)) { + freed = freed ? : 1; + put_shrinker(shrinker); + return freed; + } + put_shrinker(shrinker); + map = rcu_dereference_protected( + memcg->nodeinfo[nid]->shrinker_map, + true); + if (rwsem_is_contended(&shrinker_rwsem)) { freed = freed ? : 1; break; @@ -719,10 +807,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, .memcg = memcg, }; + /* See comment in shrink_slab_memcg. */ + if (!get_shrinker(shrinker)) + continue; + up_read(&shrinker_rwsem); + ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) ret = 0; freed += ret; + + /* + * We can safely continue traverse of the shrinker_list as + * our shrinker is still on the list due to refcount. + */ + if (!down_read_trylock(&shrinker_rwsem)) { + freed = freed ? : 1; + put_shrinker(shrinker); + goto out; + } + put_shrinker(shrinker); + /* * Bail out if someone want to register a new shrinker to * prevent the regsitration from being stalled for long periods
We have a problem that shrinker_rwsem can be held for a long time for read in shrink_slab, at the same time any process which is trying to manage shrinkers hangs. The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. It tries to shrink something on nfs (hard) but nfs server is dead at these moment already and rpc will never succeed. Generally any shrinker can take significant time to do_shrink_slab, so it's a bad idea to hold the list lock here. We have a similar problem in shrink_slab_memcg, except that we are traversing shrinker_map+shrinker_idr there. The idea of the patch is to inc a refcount to the chosen shrinker so it won't disappear and release shrinker_rwsem while we are in do_shrink_slab, after that we will reacquire shrinker_rwsem, dec the refcount and continue the traversal. We also need a wait_queue so that unregister_shrinker can wait for the refcnt to become zero. Only after these we can safely remove the shrinker from list and idr, and free the shrinker. I've reproduced the nfs hang in do_shrink_slab with the patch applied on ms kernel, all other mount/unmount pass fine without any hang. Here is a reproduction on kernel without patch: 1) Setup nfs on server node with some files in it (e.g. 200) [server]# cat /etc/exports /vz/nfs2 *(ro,no_root_squash,no_subtree_check,async) 2) Hard mount it on client node [client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt 3) Open some (e.g. 200) files on the mount [client]# for i in $(find /mnt/ -type f | head -n 200); \ do setsid sleep 1000 &>/dev/null <$i & done 4) Kill all openers [client]# killall sleep -9 5) Put your network cable out on client node 6) Drop caches on the client, it will hang on nfs while holding shrinker_rwsem lock for read [client]# echo 3 > /proc/sys/vm/drop_caches crash> bt ... PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" #0 [...] __schedule at ... #1 [...] schedule at ... #2 [...] rpc_wait_bit_killable at ... [sunrpc] #3 [...] __wait_on_bit at ... #4 [...] out_of_line_wait_on_bit at ... #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] #6 [...] nfs4_proc_delegreturn at ... [nfsv4] #7 [...] nfs_do_return_delegation at ... [nfsv4] #8 [...] nfs4_evict_inode at ... [nfsv4] #9 [...] evict at ... #10 [...] dispose_list at ... #11 [...] prune_icache_sb at ... #12 [...] super_cache_scan at ... #13 [...] do_shrink_slab at ... #14 [...] shrink_slab at ... #15 [...] drop_slab_node at ... #16 [...] drop_slab at ... #17 [...] drop_caches_sysctl_handler at ... #18 [...] proc_sys_call_handler at ... #19 [...] vfs_write at ... #20 [...] ksys_write at ... #21 [...] do_syscall_64 at ... #22 [...] entry_SYSCALL_64_after_hwframe at ... 7) All other mount/umount activity now hangs with no luck to take shrinker_rwsem for write. [client]# mount -t tmpfs tmpfs /tmp crash> bt ... PID: 5464 TASK: ... CPU: 3 COMMAND: "mount" #0 [...] __schedule at ... #1 [...] schedule at ... #2 [...] rwsem_down_write_slowpath at ... #3 [...] prealloc_shrinker at ... #4 [...] alloc_super at ... #5 [...] sget at ... #6 [...] mount_nodev at ... #7 [...] legacy_get_tree at ... #8 [...] vfs_get_tree at ... #9 [...] do_mount at ... #10 [...] ksys_mount at ... #11 [...] __x64_sys_mount at ... #12 [...] do_syscall_64 at ... #13 [...] entry_SYSCALL_64_after_hwframe at ... That is on almost clean and almost mainstream Fedora kernel: [client]# uname -a Linux snorch 5.3.8-200.snorch.fc30.x86_64 #1 SMP Mon Nov 11 16:01:15 MSK 2019 x86_64 x86_64 x86_64 GNU/Linux Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> --- include/linux/memcontrol.h | 6 +++ include/linux/shrinker.h | 6 +++ mm/memcontrol.c | 16 ++++++ mm/vmscan.c | 107 ++++++++++++++++++++++++++++++++++++- 4 files changed, 134 insertions(+), 1 deletion(-)