Message ID | 20201210080844.23741-1-sjpark@amazon.com (mailing list archive) |
---|---|
Headers | show |
Series | net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' | expand |
On Thu, Dec 10, 2020 at 9:09 AM SeongJae Park <sjpark@amazon.com> wrote: > > From: SeongJae Park <sjpark@amazon.de> > > On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls > make the number of active slab objects including 'sock_inode_cache' type > rapidly and continuously increase. As a result, memory pressure occurs. > > In more detail, I made an artificial reproducer that resembles the > workload that we found the problem and reproduce the problem faster. It > merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes > about 2 minutes. On 40 CPU cores, 70GB DRAM machine, it reduced about > 15GB of available memory in total. Note that the issue don't reproduce > on every machine. On my 6 CPU cores machine, the problem didn't > reproduce. OK, that is the number before the patch, but what is the number after the patch ? I think the idea is very nice, but this will serialize fqdir hash tables destruction on one single cpu, this might become a real issue _if_ these hash tables are populated. (Obviously in your for (i=1;i<50000;i++) unshare(CLONE_NEWNET); all these tables are empty...) As you may now, frags are often used as vectors for DDOS attacks. I would suggest maybe to not (ab)use system_wq, but a dedicated work queue with a limit (@max_active argument set to 1 in alloc_workqueue()) , to make sure that the number of threads is optimal/bounded. Only the phase after hash table removal could benefit from your deferral to a single context, so that a single rcu_barrier() is active, since the part after rcu_barrier() is damn cheap and _can_ be serialized if (refcount_dec_and_test(&f->refcnt)) complete(&f->completion); Thanks ! > > 'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the > relevant memory objects. They are asynchronously invoked by the work > queues and internally use 'rcu_barrier()' to ensure safe destructions. > 'cleanup_net()' works in a batched maneer in a single thread worker, > while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the > 'system_wq'. > > Therefore, 'fqdir_work_fn()' called frequently under the workload and > made the contention for 'rcu_barrier()' high. In more detail, the > global mutex, 'rcu_state.barrier_mutex' became the bottleneck. > > I tried making 'fqdir_work_fn()' batched and confirmed it works. The > following patch is for the change. I think this is the right solution > for point fix of this issue, but someone might blame different parts. > > 1. User: Frequent 'unshare()' calls > From some point of view, such frequent 'unshare()' calls might seem only > insane. > > 2. Global mutex in 'rcu_barrier()' > Because of the global mutex, 'rcu_barrier()' callers could wait long > even after the callbacks started before the call finished. Therefore, > similar issues could happen in another 'rcu_barrier()' usages. Maybe we > can use some wait queue like mechanism to notify the waiters when the > desired time came. > > I personally believe applying the point fix for now and making > 'rcu_barrier()' improvement in longterm make sense. If I'm missing > something or you have different opinion, please feel free to let me > know. > > > Patch History > ------------- > > Changes from v1 > (https://lore.kernel.org/netdev/20201208094529.23266-1-sjpark@amazon.com/) > - Keep xmas tree variable ordering (Jakub Kicinski) > - Add more numbers (Eric Dumazet) > - Use 'llist_for_each_entry_safe()' (Eric Dumazet) > > SeongJae Park (1): > net/ipv4/inet_fragment: Batch fqdir destroy works > > include/net/inet_frag.h | 2 +- > net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++-------- > 2 files changed, 21 insertions(+), 9 deletions(-) > > -- > 2.17.1 >
On Thu, 10 Dec 2020 15:09:10 +0100 Eric Dumazet <edumazet@google.com> wrote: > On Thu, Dec 10, 2020 at 9:09 AM SeongJae Park <sjpark@amazon.com> wrote: > > > > From: SeongJae Park <sjpark@amazon.de> > > > > On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls > > make the number of active slab objects including 'sock_inode_cache' type > > rapidly and continuously increase. As a result, memory pressure occurs. > > > > In more detail, I made an artificial reproducer that resembles the > > workload that we found the problem and reproduce the problem faster. It > > merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes > > about 2 minutes. On 40 CPU cores, 70GB DRAM machine, it reduced about > > 15GB of available memory in total. Note that the issue don't reproduce > > on every machine. On my 6 CPU cores machine, the problem didn't > > reproduce. > > OK, that is the number before the patch, but what is the number after > the patch ? No continuous memory reduction but some fluctuation observed. Nevertheless, the available memory reduction was only up to about 400MB. > > I think the idea is very nice, but this will serialize fqdir hash > tables destruction on one single cpu, > this might become a real issue _if_ these hash tables are populated. > > (Obviously in your for (i=1;i<50000;i++) unshare(CLONE_NEWNET); all > these tables are empty...) > > As you may now, frags are often used as vectors for DDOS attacks. > > I would suggest maybe to not (ab)use system_wq, but a dedicated work queue > with a limit (@max_active argument set to 1 in alloc_workqueue()) , to > make sure that the number of > threads is optimal/bounded. > > Only the phase after hash table removal could benefit from your > deferral to a single context, > so that a single rcu_barrier() is active, since the part after > rcu_barrier() is damn cheap and _can_ be serialized > > if (refcount_dec_and_test(&f->refcnt)) > complete(&f->completion); Good point, thanks for this kind suggestion. I will do so in next version. Thanks, SeongJae Park
From: SeongJae Park <sjpark@amazon.de> On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls make the number of active slab objects including 'sock_inode_cache' type rapidly and continuously increase. As a result, memory pressure occurs. In more detail, I made an artificial reproducer that resembles the workload that we found the problem and reproduce the problem faster. It merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes about 2 minutes. On 40 CPU cores, 70GB DRAM machine, it reduced about 15GB of available memory in total. Note that the issue don't reproduce on every machine. On my 6 CPU cores machine, the problem didn't reproduce. 'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the relevant memory objects. They are asynchronously invoked by the work queues and internally use 'rcu_barrier()' to ensure safe destructions. 'cleanup_net()' works in a batched maneer in a single thread worker, while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the 'system_wq'. Therefore, 'fqdir_work_fn()' called frequently under the workload and made the contention for 'rcu_barrier()' high. In more detail, the global mutex, 'rcu_state.barrier_mutex' became the bottleneck. I tried making 'fqdir_work_fn()' batched and confirmed it works. The following patch is for the change. I think this is the right solution for point fix of this issue, but someone might blame different parts. 1. User: Frequent 'unshare()' calls From some point of view, such frequent 'unshare()' calls might seem only insane. 2. Global mutex in 'rcu_barrier()' Because of the global mutex, 'rcu_barrier()' callers could wait long even after the callbacks started before the call finished. Therefore, similar issues could happen in another 'rcu_barrier()' usages. Maybe we can use some wait queue like mechanism to notify the waiters when the desired time came. I personally believe applying the point fix for now and making 'rcu_barrier()' improvement in longterm make sense. If I'm missing something or you have different opinion, please feel free to let me know. Patch History ------------- Changes from v1 (https://lore.kernel.org/netdev/20201208094529.23266-1-sjpark@amazon.com/) - Keep xmas tree variable ordering (Jakub Kicinski) - Add more numbers (Eric Dumazet) - Use 'llist_for_each_entry_safe()' (Eric Dumazet) SeongJae Park (1): net/ipv4/inet_fragment: Batch fqdir destroy works include/net/inet_frag.h | 2 +- net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++-------- 2 files changed, 21 insertions(+), 9 deletions(-)