mbox series

[v2,0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)'

Message ID 20201210080844.23741-1-sjpark@amazon.com (mailing list archive)
Headers show
Series net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' | expand

Message

SeongJae Park Dec. 10, 2020, 8:08 a.m. UTC
From: SeongJae Park <sjpark@amazon.de>

On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase.  As a result, memory pressure occurs.

In more detail, I made an artificial reproducer that resembles the
workload that we found the problem and reproduce the problem faster.  It
merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop.  It takes
about 2 minutes.  On 40 CPU cores, 70GB DRAM machine, it reduced about
15GB of available memory in total.  Note that the issue don't reproduce
on every machine.  On my 6 CPU cores machine, the problem didn't
reproduce.

'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects.  They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'.

Therefore, 'fqdir_work_fn()' called frequently under the workload and
made the contention for 'rcu_barrier()' high.  In more detail, the
global mutex, 'rcu_state.barrier_mutex' became the bottleneck.

I tried making 'fqdir_work_fn()' batched and confirmed it works.  The
following patch is for the change.  I think this is the right solution
for point fix of this issue, but someone might blame different parts.

1. User: Frequent 'unshare()' calls
From some point of view, such frequent 'unshare()' calls might seem only
insane.

2. Global mutex in 'rcu_barrier()'
Because of the global mutex, 'rcu_barrier()' callers could wait long
even after the callbacks started before the call finished.  Therefore,
similar issues could happen in another 'rcu_barrier()' usages.  Maybe we
can use some wait queue like mechanism to notify the waiters when the
desired time came.

I personally believe applying the point fix for now and making
'rcu_barrier()' improvement in longterm make sense.  If I'm missing
something or you have different opinion, please feel free to let me
know.


Patch History
-------------

Changes from v1
(https://lore.kernel.org/netdev/20201208094529.23266-1-sjpark@amazon.com/)
- Keep xmas tree variable ordering (Jakub Kicinski)
- Add more numbers (Eric Dumazet)
- Use 'llist_for_each_entry_safe()' (Eric Dumazet)

SeongJae Park (1):
  net/ipv4/inet_fragment: Batch fqdir destroy works

 include/net/inet_frag.h  |  2 +-
 net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

Comments

Eric Dumazet Dec. 10, 2020, 2:09 p.m. UTC | #1
On Thu, Dec 10, 2020 at 9:09 AM SeongJae Park <sjpark@amazon.com> wrote:
>
> From: SeongJae Park <sjpark@amazon.de>
>
> On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
> make the number of active slab objects including 'sock_inode_cache' type
> rapidly and continuously increase.  As a result, memory pressure occurs.
>
> In more detail, I made an artificial reproducer that resembles the
> workload that we found the problem and reproduce the problem faster.  It
> merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop.  It takes
> about 2 minutes.  On 40 CPU cores, 70GB DRAM machine, it reduced about
> 15GB of available memory in total.  Note that the issue don't reproduce
> on every machine.  On my 6 CPU cores machine, the problem didn't
> reproduce.

OK, that is the number before the patch, but what is the number after
the patch ?

I think the idea is very nice, but this will serialize fqdir hash
tables destruction on one single cpu,
this might become a real issue _if_ these hash tables are populated.

(Obviously in your for (i=1;i<50000;i++) unshare(CLONE_NEWNET);  all
these tables are empty...)

As you may now, frags are often used as vectors for DDOS attacks.

I would suggest maybe to not (ab)use system_wq, but a dedicated work queue
with a limit (@max_active argument set to 1 in alloc_workqueue()) , to
make sure that the number of
threads is optimal/bounded.

Only the phase after hash table removal could benefit from your
deferral to a single context,
so that a single rcu_barrier() is active, since the part after
rcu_barrier() is damn cheap and _can_ be serialized

  if (refcount_dec_and_test(&f->refcnt))
                complete(&f->completion);

Thanks !

>
> 'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
> relevant memory objects.  They are asynchronously invoked by the work
> queues and internally use 'rcu_barrier()' to ensure safe destructions.
> 'cleanup_net()' works in a batched maneer in a single thread worker,
> while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
> 'system_wq'.
>
> Therefore, 'fqdir_work_fn()' called frequently under the workload and
> made the contention for 'rcu_barrier()' high.  In more detail, the
> global mutex, 'rcu_state.barrier_mutex' became the bottleneck.
>
> I tried making 'fqdir_work_fn()' batched and confirmed it works.  The
> following patch is for the change.  I think this is the right solution
> for point fix of this issue, but someone might blame different parts.
>
> 1. User: Frequent 'unshare()' calls
> From some point of view, such frequent 'unshare()' calls might seem only
> insane.
>
> 2. Global mutex in 'rcu_barrier()'
> Because of the global mutex, 'rcu_barrier()' callers could wait long
> even after the callbacks started before the call finished.  Therefore,
> similar issues could happen in another 'rcu_barrier()' usages.  Maybe we
> can use some wait queue like mechanism to notify the waiters when the
> desired time came.
>
> I personally believe applying the point fix for now and making
> 'rcu_barrier()' improvement in longterm make sense.  If I'm missing
> something or you have different opinion, please feel free to let me
> know.
>
>
> Patch History
> -------------
>
> Changes from v1
> (https://lore.kernel.org/netdev/20201208094529.23266-1-sjpark@amazon.com/)
> - Keep xmas tree variable ordering (Jakub Kicinski)
> - Add more numbers (Eric Dumazet)
> - Use 'llist_for_each_entry_safe()' (Eric Dumazet)
>
> SeongJae Park (1):
>   net/ipv4/inet_fragment: Batch fqdir destroy works
>
>  include/net/inet_frag.h  |  2 +-
>  net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
>  2 files changed, 21 insertions(+), 9 deletions(-)
>
> --
> 2.17.1
>
SeongJae Park Dec. 10, 2020, 10:16 p.m. UTC | #2
On Thu, 10 Dec 2020 15:09:10 +0100 Eric Dumazet <edumazet@google.com> wrote:

> On Thu, Dec 10, 2020 at 9:09 AM SeongJae Park <sjpark@amazon.com> wrote:
> >
> > From: SeongJae Park <sjpark@amazon.de>
> >
> > On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
> > make the number of active slab objects including 'sock_inode_cache' type
> > rapidly and continuously increase.  As a result, memory pressure occurs.
> >
> > In more detail, I made an artificial reproducer that resembles the
> > workload that we found the problem and reproduce the problem faster.  It
> > merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop.  It takes
> > about 2 minutes.  On 40 CPU cores, 70GB DRAM machine, it reduced about
> > 15GB of available memory in total.  Note that the issue don't reproduce
> > on every machine.  On my 6 CPU cores machine, the problem didn't
> > reproduce.
> 
> OK, that is the number before the patch, but what is the number after
> the patch ?

No continuous memory reduction but some fluctuation observed.  Nevertheless,
the available memory reduction was only up to about 400MB.

> 
> I think the idea is very nice, but this will serialize fqdir hash
> tables destruction on one single cpu,
> this might become a real issue _if_ these hash tables are populated.
> 
> (Obviously in your for (i=1;i<50000;i++) unshare(CLONE_NEWNET);  all
> these tables are empty...)
> 
> As you may now, frags are often used as vectors for DDOS attacks.
> 
> I would suggest maybe to not (ab)use system_wq, but a dedicated work queue
> with a limit (@max_active argument set to 1 in alloc_workqueue()) , to
> make sure that the number of
> threads is optimal/bounded.
> 
> Only the phase after hash table removal could benefit from your
> deferral to a single context,
> so that a single rcu_barrier() is active, since the part after
> rcu_barrier() is damn cheap and _can_ be serialized
> 
>   if (refcount_dec_and_test(&f->refcnt))
>                 complete(&f->completion);

Good point, thanks for this kind suggestion.  I will do so in next version.


Thanks,
SeongJae Park