[v2,10/13] nfsd: move th_cnt into nfsd_net

Message ID	0fa7bf5b5bbc863180e50363435b5a56c43dc5e3.1706212208.git.josef@toxicpanda.com (mailing list archive)
State	New
Headers	show Received: from mail-yb1-f173.google.com (mail-yb1-f173.google.com [209.85.219.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 93E771350EE for <linux-nfs@vger.kernel.org>; Thu, 25 Jan 2024 19:53:47 +0000 (UTC) From: Josef Bacik <josef@toxicpanda.com> To: linux-nfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v2 10/13] nfsd: move th_cnt into nfsd_net Date: Thu, 25 Jan 2024 14:53:20 -0500 Message-ID: <0fa7bf5b5bbc863180e50363435b5a56c43dc5e3.1706212208.git.josef@toxicpanda.com> In-Reply-To: <cover.1706212207.git.josef@toxicpanda.com> References: <cover.1706212207.git.josef@toxicpanda.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Make nfs and nfsd stats visible in network ns \| expand [v2,00/13] Make nfs and nfsd stats visible in network ns [v2,01/13] sunrpc: don't change ->sv_stats if it doesn't exist [v2,02/13] nfs: stop setting ->pg_stats for unused stats [v2,03/13] sunrpc: pass in the sv_stats struct through svc_create* [v2,04/13] sunrpc: remove ->pg_stats from svc_program [v2,05/13] sunrpc: add a struct rpc_stats arg to rpc_create_args [v2,06/13] sunrpc: use the struct net as the svc proc private [v2,07/13] nfsd: rename NFSD_NET_* to NFSD_STATS_* [v2,08/13] nfsd: expose /proc/net/sunrpc/nfsd in net namespaces [v2,09/13] nfsd: make all of the nfsd stats per-network namespace [v2,10/13] nfsd: move th_cnt into nfsd_net [v2,11/13] nfsd: make svc_stat per-network namespace instead of global [v2,12/13] nfs: expose /proc/net/sunrpc/nfs in net namespaces [v2,13/13] nfs: make the rpc_stat per net namespace

Josef Bacik Jan. 25, 2024, 7:53 p.m. UTC

This is the last global stat, move it into nfsd_net and adjust all the
users to use that variant instead of the global one.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/nfsd/netns.h  | 3 +++
 fs/nfsd/nfssvc.c | 4 ++--
 fs/nfsd/stats.c  | 4 ++--
 fs/nfsd/stats.h  | 6 ------
 4 files changed, 7 insertions(+), 10 deletions(-)

Chuck Lever Jan. 25, 2024, 9:01 p.m. UTC | #1

On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> This is the last global stat, move it into nfsd_net and adjust all the
> users to use that variant instead of the global one.

Hm. I thought nfsd threads were a global resource -- they service
all network namespaces. So, shouldn't the same thread count be
surfaced to all containers? Won't they all see all of the nfsd
processes?


> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/nfsd/netns.h  | 3 +++
>  fs/nfsd/nfssvc.c | 4 ++--
>  fs/nfsd/stats.c  | 4 ++--
>  fs/nfsd/stats.h  | 6 ------
>  4 files changed, 7 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
> index 0cef4bb407a9..8d3f4cb7cab4 100644
> --- a/fs/nfsd/netns.h
> +++ b/fs/nfsd/netns.h
> @@ -179,6 +179,9 @@ struct nfsd_net {
>  	/* Per-netns stats counters */
>  	struct percpu_counter    counter[NFSD_STATS_COUNTERS_NUM];
>  
> +	/* number of available threads */
> +	atomic_t                 th_cnt;
> +
>  	/* longest hash chain seen */
>  	unsigned int             longest_chain;
>  
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index d98a6abad990..0961b95dcef6 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -924,7 +924,7 @@ nfsd(void *vrqstp)
>  
>  	current->fs->umask = 0;
>  
> -	atomic_inc(&nfsdstats.th_cnt);
> +	atomic_inc(&nn->th_cnt);
>  
>  	set_freezable();
>  
> @@ -940,7 +940,7 @@ nfsd(void *vrqstp)
>  		nfsd_file_net_dispose(nn);
>  	}
>  
> -	atomic_dec(&nfsdstats.th_cnt);
> +	atomic_dec(&nn->th_cnt);
>  
>  out:
>  	/* Release the thread */
> diff --git a/fs/nfsd/stats.c b/fs/nfsd/stats.c
> index 44e275324b06..360e6dbf4e5c 100644
> --- a/fs/nfsd/stats.c
> +++ b/fs/nfsd/stats.c
> @@ -27,7 +27,6 @@
>  
>  #include "nfsd.h"
>  
> -struct nfsd_stats	nfsdstats;
>  struct svc_stat		nfsd_svcstats = {
>  	.program	= &nfsd_program,
>  };
> @@ -47,7 +46,7 @@ static int nfsd_show(struct seq_file *seq, void *v)
>  		   percpu_counter_sum_positive(&nn->counter[NFSD_STATS_IO_WRITE]));
>  
>  	/* thread usage: */
> -	seq_printf(seq, "th %u 0", atomic_read(&nfsdstats.th_cnt));
> +	seq_printf(seq, "th %u 0", atomic_read(&nn->th_cnt));
>  
>  	/* deprecated thread usage histogram stats */
>  	for (i = 0; i < 10; i++)
> @@ -112,6 +111,7 @@ void nfsd_percpu_counters_destroy(struct percpu_counter counters[], int num)
>  
>  int nfsd_stat_counters_init(struct nfsd_net *nn)
>  {
> +	atomic_set(&nn->th_cnt, 0);
>  	return nfsd_percpu_counters_init(nn->counter, NFSD_STATS_COUNTERS_NUM);
>  }
>  
> diff --git a/fs/nfsd/stats.h b/fs/nfsd/stats.h
> index c24be4ddbe7d..5675d283a537 100644
> --- a/fs/nfsd/stats.h
> +++ b/fs/nfsd/stats.h
> @@ -10,12 +10,6 @@
>  #include <uapi/linux/nfsd/stats.h>
>  #include <linux/percpu_counter.h>
>  
> -struct nfsd_stats {
> -	atomic_t	th_cnt;		/* number of available threads */
> -};
> -
> -extern struct nfsd_stats	nfsdstats;
> -
>  extern struct svc_stat		nfsd_svcstats;
>  
>  int nfsd_percpu_counters_init(struct percpu_counter *counters, int num);
> -- 
> 2.43.0
> 
>

Josef Bacik Jan. 25, 2024, 9:56 p.m. UTC | #2

On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
> On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> > This is the last global stat, move it into nfsd_net and adjust all the
> > users to use that variant instead of the global one.
> 
> Hm. I thought nfsd threads were a global resource -- they service
> all network namespaces. So, shouldn't the same thread count be
> surfaced to all containers? Won't they all see all of the nfsd
> processes?
> 

I don't think we want the network namespaces seeing how many threads exist in
the entire system right?

Additionally it appears that we can have multiple threads per network namespace,
so it's not like this will just show 1 for each individual nn, it'll show
however many threads have been configured for that nfsd in that network
namespace.

I'm good either way, but it makes sense to me to only surface the network
namespace related thread count.  I could probably have a global counter and only
surface the global counter if net == &init_net.  Let me know what you prefer.
Thanks,

Josef

Jeffrey Layton Jan. 26, 2024, 1:01 p.m. UTC | #3

On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
> On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
> > On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> > > This is the last global stat, move it into nfsd_net and adjust all the
> > > users to use that variant instead of the global one.
> > 
> > Hm. I thought nfsd threads were a global resource -- they service
> > all network namespaces. So, shouldn't the same thread count be
> > surfaced to all containers? Won't they all see all of the nfsd
> > processes?
> > 
> 

Each container is going to start /proc/fs/nfsd/threads number of threads
regardless. I hadn't actually grokked that they just get tossed onto the
pile of threads that service requests.

Is is possible for one container to start a small number of threads but
have its client load be such that it spills over and ends up stealing
threads from other containers?

> I don't think we want the network namespaces seeing how many threads exist in
> the entire system right?
> 
> Additionally it appears that we can have multiple threads per network namespace,
> so it's not like this will just show 1 for each individual nn, it'll show
> however many threads have been configured for that nfsd in that network
> namespace.
> 
> I'm good either way, but it makes sense to me to only surface the network
> namespace related thread count.  I could probably have a global counter and only
> surface the global counter if net == &init_net.  Let me know what you prefer.
> Thanks,
> 

The old stat meant "number of threads that are currently busy". The new
one will mean "number of threads that are busy servicing requests in
this namespace". The latter seems more useful on a per-ns basis.

In fact, it might be interesting to throw out some kind of warning when
the th_cnt exceeds the number of threads that you've allocated in the
container. That might be an indicator that you didn't spin up enough and
are poaching from other containers.

Or... maybe we should be queueing the RPC when that happens so that you
can't use more than you've allocated for the container?

Chuck Lever Jan. 26, 2024, 1:48 p.m. UTC | #4

> On Jan 26, 2024, at 8:01 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
>> On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
>>> On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
>>>> This is the last global stat, move it into nfsd_net and adjust all the
>>>> users to use that variant instead of the global one.
>>> 
>>> Hm. I thought nfsd threads were a global resource -- they service
>>> all network namespaces. So, shouldn't the same thread count be
>>> surfaced to all containers? Won't they all see all of the nfsd
>>> processes?
> 
> Each container is going to start /proc/fs/nfsd/threads number of threads
> regardless. I hadn't actually grokked that they just get tossed onto the
> pile of threads that service requests.
> 
> Is is possible for one container to start a small number of threads but
> have its client load be such that it spills over and ends up stealing
> threads from other containers?

I haven't seen any code that manages resources based on namespace,
except in filecache.c to restrict writeback per namespace.

My impression is that any nfsd thread can serve any namespace. I'm
not sure it is currently meaningful for a particular net namespace to
"create" more threads.

If someone would like that level of control, we could implement a
cgroup mechanism and have one or more separate svc_pools per net
namespace, maybe? </hand wave>

>> I don't think we want the network namespaces seeing how many threads exist in
>> the entire system right?

If someone in a non-init net namespace does a "pgrep -c nfsd" don't
they see the total nfsd thread count for the host?

>> Additionally it appears that we can have multiple threads per network namespace,
>> so it's not like this will just show 1 for each individual nn, it'll show
>> however many threads have been configured for that nfsd in that network
>> namespace.

I've never tried this, so I'm speculating. But it seems like for
now, because all nfsd threads can serve all namespaces, they should
all see the global thread count stat.

Then later we can refine it.

>> I'm good either way, but it makes sense to me to only surface the network
>> namespace related thread count.  I could probably have a global counter and only
>> surface the global counter if net == &init_net.  Let me know what you prefer.

--
Chuck Lever

Jeffrey Layton Jan. 26, 2024, 2:08 p.m. UTC | #5

On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
> 
> > On Jan 26, 2024, at 8:01 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
> > > On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
> > > > On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> > > > > This is the last global stat, move it into nfsd_net and adjust all the
> > > > > users to use that variant instead of the global one.
> > > > 
> > > > Hm. I thought nfsd threads were a global resource -- they service
> > > > all network namespaces. So, shouldn't the same thread count be
> > > > surfaced to all containers? Won't they all see all of the nfsd
> > > > processes?
> > 
> > Each container is going to start /proc/fs/nfsd/threads number of threads
> > regardless. I hadn't actually grokked that they just get tossed onto the
> > pile of threads that service requests.
> > 
> > Is is possible for one container to start a small number of threads but
> > have its client load be such that it spills over and ends up stealing
> > threads from other containers?
> 
> I haven't seen any code that manages resources based on namespace,
> except in filecache.c to restrict writeback per namespace.
> 
> My impression is that any nfsd thread can serve any namespace. I'm
> not sure it is currently meaningful for a particular net namespace to
> "create" more threads.
> 
> If someone would like that level of control, we could implement a
> cgroup mechanism and have one or more separate svc_pools per net
> namespace, maybe? </hand wave>
> 

AFAICT, the total number of threads on the system will be the sum of the
threads started in each of the containers. They do just go into a big
pile, and whichever one wakes up will service the request, so the
threads aren't associated with the netns, per-se. The svc_rqst's however
_are_ per-netns. So, I don't see anything that ensures that a container
doesn't exceed the number of threads it started on its own behalf.

<hand wave>
I'm not sure we'd need to tie this in to cgroups. Now that Josef is
moving some of these key structures to be per-net, it should be fairly
simple to have nfsd() just look at the th_cnt and the thread count in
the current namespace, and just enqueue the RPC rather than doing it?
</hand wave>

OTOH, maybe I'm overly concerned here.

> 
> > > I don't think we want the network namespaces seeing how many threads exist in
> > > the entire system right?
> 
> If someone in a non-init net namespace does a "pgrep -c nfsd" don't
> they see the total nfsd thread count for the host?
> 

Yes, they're kernel threads and they aren't associated with a particular
pid namespace.

> 
> > > Additionally it appears that we can have multiple threads per network namespace,
> > > so it's not like this will just show 1 for each individual nn, it'll show
> > > however many threads have been configured for that nfsd in that network
> > > namespace.
> 
> I've never tried this, so I'm speculating. But it seems like for
> now, because all nfsd threads can serve all namespaces, they should
> all see the global thread count stat.
> 
> Then later we can refine it.
> 

I don't think that info is particularly useful though, and it certainly
breaks expectations wrt container isolation.

Look at it this way:

Suppose I have access to a container and I spin up nfsd with a
particular number of threads. I now want to know "did I spin up enough
threads?" By making this per-namespace as Josef suggests it should be
fairly simple to tell whether my clients are regularly overrunning the
threads I started. With this info as global, I have no idea what netns
the RPCs being counted are against. I can't do anything with that info.

> 
> > > I'm good either way, but it makes sense to me to only surface the network
> > > namespace related thread count.  I could probably have a global counter and only
> > > surface the global counter if net == &init_net.  Let me know what you prefer.
>

Chuck Lever Jan. 26, 2024, 2:27 p.m. UTC | #6

> On Jan 26, 2024, at 9:08 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
>> 
>>> On Jan 26, 2024, at 8:01 AM, Jeff Layton <jlayton@kernel.org> wrote:
>>> 
>>> On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
>>>> On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
>>>>> On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
>>>>>> This is the last global stat, move it into nfsd_net and adjust all the
>>>>>> users to use that variant instead of the global one.
>>>>> 
>>>>> Hm. I thought nfsd threads were a global resource -- they service
>>>>> all network namespaces. So, shouldn't the same thread count be
>>>>> surfaced to all containers? Won't they all see all of the nfsd
>>>>> processes?
>>> 
>>> Each container is going to start /proc/fs/nfsd/threads number of threads
>>> regardless. I hadn't actually grokked that they just get tossed onto the
>>> pile of threads that service requests.
>>> 
>>> Is is possible for one container to start a small number of threads but
>>> have its client load be such that it spills over and ends up stealing
>>> threads from other containers?
>> 
>> I haven't seen any code that manages resources based on namespace,
>> except in filecache.c to restrict writeback per namespace.
>> 
>> My impression is that any nfsd thread can serve any namespace. I'm
>> not sure it is currently meaningful for a particular net namespace to
>> "create" more threads.
>> 
>> If someone would like that level of control, we could implement a
>> cgroup mechanism and have one or more separate svc_pools per net
>> namespace, maybe? </hand wave>
>> 
> 
> AFAICT, the total number of threads on the system will be the sum of the
> threads started in each of the containers. They do just go into a big
> pile, and whichever one wakes up will service the request, so the
> threads aren't associated with the netns, per-se. The svc_rqst's however
> _are_ per-netns. So, I don't see anything that ensures that a container
> doesn't exceed the number of threads it started on its own behalf.
> 
> <hand wave>
> I'm not sure we'd need to tie this in to cgroups.

Not a need, but cgroups are typically the way that such
resource constraints are managed, that's all.


> Now that Josef is
> moving some of these key structures to be per-net, it should be fairly
> simple to have nfsd() just look at the th_cnt and the thread count in
> the current namespace, and just enqueue the RPC rather than doing it?
> </hand wave>
> 
> OTOH, maybe I'm overly concerned here.
> 
> 
>> 
>>>> I don't think we want the network namespaces seeing how many threads exist in
>>>> the entire system right?
>> 
>> If someone in a non-init net namespace does a "pgrep -c nfsd" don't
>> they see the total nfsd thread count for the host?
>> 
> 
> Yes, they're kernel threads and they aren't associated with a particular
> pid namespace.
> 
>> 
>>>> Additionally it appears that we can have multiple threads per network namespace,
>>>> so it's not like this will just show 1 for each individual nn, it'll show
>>>> however many threads have been configured for that nfsd in that network
>>>> namespace.
>> 
>> I've never tried this, so I'm speculating. But it seems like for
>> now, because all nfsd threads can serve all namespaces, they should
>> all see the global thread count stat.
>> 
>> Then later we can refine it.
>> 
> 
> I don't think that info is particularly useful though, and it certainly
> breaks expectations wrt container isolation.
> 
> Look at it this way:
> 
> Suppose I have access to a container and I spin up nfsd with a
> particular number of threads. I now want to know "did I spin up enough
> threads?"

It makes sense to me for a container to ask for one or more
NFSD listeners. But I'm not yet clear on what it means for
a container to try to alter the NFSD thread count, which is
currently global.


> By making this per-namespace as Josef suggests it should be
> fairly simple to tell whether my clients are regularly overrunning the
> threads I started. With this info as global, I have no idea what netns
> the RPCs being counted are against. I can't do anything with that info.

That's true. I'm just not sure we need to do anything
significant about it as part of this patch series.

I'm not at all advocating leaving it like this. I agree it
needs updating.

For now, just fill in that thread count stat with the global
thread count, and later when we have a better concept for
what it means to "adjust the nfsd thread count from inside a
container" we can come back and make it make sense.


--
Chuck Lever

Jeffrey Layton Jan. 26, 2024, 3:03 p.m. UTC | #7

On Fri, 2024-01-26 at 14:27 +0000, Chuck Lever III wrote:
> 
> > On Jan 26, 2024, at 9:08 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
> > > 
> > > > On Jan 26, 2024, at 8:01 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
> > > > > On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
> > > > > > On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> > > > > > > This is the last global stat, move it into nfsd_net and adjust all the
> > > > > > > users to use that variant instead of the global one.
> > > > > > 
> > > > > > Hm. I thought nfsd threads were a global resource -- they service
> > > > > > all network namespaces. So, shouldn't the same thread count be
> > > > > > surfaced to all containers? Won't they all see all of the nfsd
> > > > > > processes?
> > > > 
> > > > Each container is going to start /proc/fs/nfsd/threads number of threads
> > > > regardless. I hadn't actually grokked that they just get tossed onto the
> > > > pile of threads that service requests.
> > > > 
> > > > Is is possible for one container to start a small number of threads but
> > > > have its client load be such that it spills over and ends up stealing
> > > > threads from other containers?
> > > 
> > > I haven't seen any code that manages resources based on namespace,
> > > except in filecache.c to restrict writeback per namespace.
> > > 
> > > My impression is that any nfsd thread can serve any namespace. I'm
> > > not sure it is currently meaningful for a particular net namespace to
> > > "create" more threads.
> > > 
> > > If someone would like that level of control, we could implement a
> > > cgroup mechanism and have one or more separate svc_pools per net
> > > namespace, maybe? </hand wave>
> > > 
> > 
> > AFAICT, the total number of threads on the system will be the sum of the
> > threads started in each of the containers. They do just go into a big
> > pile, and whichever one wakes up will service the request, so the
> > threads aren't associated with the netns, per-se. The svc_rqst's however
> > _are_ per-netns. So, I don't see anything that ensures that a container
> > doesn't exceed the number of threads it started on its own behalf.
> > 
> > <hand wave>
> > I'm not sure we'd need to tie this in to cgroups.
> 
> Not a need, but cgroups are typically the way that such
> resource constraints are managed, that's all.
> 
> 
> > Now that Josef is
> > moving some of these key structures to be per-net, it should be fairly
> > simple to have nfsd() just look at the th_cnt and the thread count in
> > the current namespace, and just enqueue the RPC rather than doing it?
> > </hand wave>
> > 
> > OTOH, maybe I'm overly concerned here.
> > 
> > 
> > > 
> > > > > I don't think we want the network namespaces seeing how many threads exist in
> > > > > the entire system right?
> > > 
> > > If someone in a non-init net namespace does a "pgrep -c nfsd" don't
> > > they see the total nfsd thread count for the host?
> > > 
> > 
> > Yes, they're kernel threads and they aren't associated with a particular
> > pid namespace.
> > 
> > > 
> > > > > Additionally it appears that we can have multiple threads per network namespace,
> > > > > so it's not like this will just show 1 for each individual nn, it'll show
> > > > > however many threads have been configured for that nfsd in that network
> > > > > namespace.
> > > 
> > > I've never tried this, so I'm speculating. But it seems like for
> > > now, because all nfsd threads can serve all namespaces, they should
> > > all see the global thread count stat.
> > > 
> > > Then later we can refine it.
> > > 
> > 
> > I don't think that info is particularly useful though, and it certainly
> > breaks expectations wrt container isolation.
> > 
> > Look at it this way:
> > 
> > Suppose I have access to a container and I spin up nfsd with a
> > particular number of threads. I now want to know "did I spin up enough
> > threads?"
> 
> It makes sense to me for a container to ask for one or more
> NFSD listeners. But I'm not yet clear on what it means for
> a container to try to alter the NFSD thread count, which is
> currently global.
> 

write_threads sets the number of nfsd threads for the svc_serv, which is
a per-net object, so serv->sv_nrthreads is only counting the threads for
its namespace.

Each container that starts an nfsd will contribute "threads" number of
nfsds to the pile. When you set "threads" to a different number, the
total pile is grown or reduced accordingly. In fact, I'm not even sure
we keep a systemwide total.

What _isn't_ done (unless I'm missing something) is any sort of
limitation on the use of those threads. So you could have a container
that starts one nfsd thread, but its clients can still have many more
RPCs in flight, up to the total number of nfsd's running on the host.
Limiting that might be possible, but I'm not yet convinced that it's a
good idea.

It might be interesting to gather some metrics though. How often do the
number of RPCs in flight exceed the number of threads that we started in
a namespace? Might also be good to gather a high-water mark too?

> 
> > By making this per-namespace as Josef suggests it should be
> > fairly simple to tell whether my clients are regularly overrunning the
> > threads I started. With this info as global, I have no idea what netns
> > the RPCs being counted are against. I can't do anything with that info.
> 
> That's true. I'm just not sure we need to do anything
> significant about it as part of this patch series.
> 
> I'm not at all advocating leaving it like this. I agree it
> needs updating.
> 
> For now, just fill in that thread count stat with the global
> thread count, and later when we have a better concept for
> what it means to "adjust the nfsd thread count from inside a
> container" we can come back and make it make sense.

It would be good to step back and decide how we think this should work.
What would this look like if we were designing it today? Then we can
decide how to get there.

Personally, I think keeping the view inside the container as isolated as
possible is the right thing to do.

Chuck Lever Jan. 26, 2024, 3:16 p.m. UTC | #8

> On Jan 26, 2024, at 10:03 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Fri, 2024-01-26 at 14:27 +0000, Chuck Lever III wrote:
>> 
>>> On Jan 26, 2024, at 9:08 AM, Jeff Layton <jlayton@kernel.org> wrote:
>>> 
>>> On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
>>>> 
>>>>> On Jan 26, 2024, at 8:01 AM, Jeff Layton <jlayton@kernel.org> wrote:
>>>>> 
>>>>> On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
>>>>>> On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
>>>>>>> On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
>>>>>>>> This is the last global stat, move it into nfsd_net and adjust all the
>>>>>>>> users to use that variant instead of the global one.
>>>>>>> 
>>>>>>> Hm. I thought nfsd threads were a global resource -- they service
>>>>>>> all network namespaces. So, shouldn't the same thread count be
>>>>>>> surfaced to all containers? Won't they all see all of the nfsd
>>>>>>> processes?
>>>>> 
>>>>> Each container is going to start /proc/fs/nfsd/threads number of threads
>>>>> regardless. I hadn't actually grokked that they just get tossed onto the
>>>>> pile of threads that service requests.
>>>>> 
>>>>> Is is possible for one container to start a small number of threads but
>>>>> have its client load be such that it spills over and ends up stealing
>>>>> threads from other containers?
>>>> 
>>>> I haven't seen any code that manages resources based on namespace,
>>>> except in filecache.c to restrict writeback per namespace.
>>>> 
>>>> My impression is that any nfsd thread can serve any namespace. I'm
>>>> not sure it is currently meaningful for a particular net namespace to
>>>> "create" more threads.
>>>> 
>>>> If someone would like that level of control, we could implement a
>>>> cgroup mechanism and have one or more separate svc_pools per net
>>>> namespace, maybe? </hand wave>
>>>> 
>>> 
>>> AFAICT, the total number of threads on the system will be the sum of the
>>> threads started in each of the containers. They do just go into a big
>>> pile, and whichever one wakes up will service the request, so the
>>> threads aren't associated with the netns, per-se. The svc_rqst's however
>>> _are_ per-netns. So, I don't see anything that ensures that a container
>>> doesn't exceed the number of threads it started on its own behalf.
>>> 
>>> <hand wave>
>>> I'm not sure we'd need to tie this in to cgroups.
>> 
>> Not a need, but cgroups are typically the way that such
>> resource constraints are managed, that's all.
>> 
>> 
>>> Now that Josef is
>>> moving some of these key structures to be per-net, it should be fairly
>>> simple to have nfsd() just look at the th_cnt and the thread count in
>>> the current namespace, and just enqueue the RPC rather than doing it?
>>> </hand wave>
>>> 
>>> OTOH, maybe I'm overly concerned here.
>>> 
>>> 
>>>> 
>>>>>> I don't think we want the network namespaces seeing how many threads exist in
>>>>>> the entire system right?
>>>> 
>>>> If someone in a non-init net namespace does a "pgrep -c nfsd" don't
>>>> they see the total nfsd thread count for the host?
>>>> 
>>> 
>>> Yes, they're kernel threads and they aren't associated with a particular
>>> pid namespace.
>>> 
>>>> 
>>>>>> Additionally it appears that we can have multiple threads per network namespace,
>>>>>> so it's not like this will just show 1 for each individual nn, it'll show
>>>>>> however many threads have been configured for that nfsd in that network
>>>>>> namespace.
>>>> 
>>>> I've never tried this, so I'm speculating. But it seems like for
>>>> now, because all nfsd threads can serve all namespaces, they should
>>>> all see the global thread count stat.
>>>> 
>>>> Then later we can refine it.
>>>> 
>>> 
>>> I don't think that info is particularly useful though, and it certainly
>>> breaks expectations wrt container isolation.
>>> 
>>> Look at it this way:
>>> 
>>> Suppose I have access to a container and I spin up nfsd with a
>>> particular number of threads. I now want to know "did I spin up enough
>>> threads?"
>> 
>> It makes sense to me for a container to ask for one or more
>> NFSD listeners. But I'm not yet clear on what it means for
>> a container to try to alter the NFSD thread count, which is
>> currently global.
>> 
> 
> write_threads sets the number of nfsd threads for the svc_serv, which is
> a per-net object, so serv->sv_nrthreads is only counting the threads for
> its namespace.

OK. I missed the part where struct svc_serv is per-net, and
each svc_serv has its own thread pool. Got it.


> Each container that starts an nfsd will contribute "threads" number of
> nfsds to the pile. When you set "threads" to a different number, the
> total pile is grown or reduced accordingly. In fact, I'm not even sure
> we keep a systemwide total.
> 
> What _isn't_ done (unless I'm missing something) is any sort of
> limitation on the use of those threads. So you could have a container
> that starts one nfsd thread, but its clients can still have many more
> RPCs in flight, up to the total number of nfsd's running on the host.
> Limiting that might be possible, but I'm not yet convinced that it's a
> good idea.
> 
> It might be interesting to gather some metrics though. How often do the
> number of RPCs in flight exceed the number of threads that we started in
> a namespace? Might also be good to gather a high-water mark too?

I had a bunch of trace points queued up for this purpose,
and all that got thrown out by Neil's recent work fixing
up our pool thread scheduler.

I'm not sure that queued requests are all that meaningful
or even quantifiable. A more meaningful metric is round
trip time measured on clients.


>>> By making this per-namespace as Josef suggests it should be
>>> fairly simple to tell whether my clients are regularly overrunning the
>>> threads I started. With this info as global, I have no idea what netns
>>> the RPCs being counted are against. I can't do anything with that info.
>> 
>> That's true. I'm just not sure we need to do anything
>> significant about it as part of this patch series.
>> 
>> I'm not at all advocating leaving it like this. I agree it
>> needs updating.
>> 
>> For now, just fill in that thread count stat with the global
>> thread count, and later when we have a better concept for
>> what it means to "adjust the nfsd thread count from inside a
>> container" we can come back and make it make sense.
> 
> It would be good to step back and decide how we think this should work.
> What would this look like if we were designing it today? Then we can
> decide how to get there.
> 
> Personally, I think keeping the view inside the container as isolated as
> possible is the right thing to do.

I agree. I would like to get Josef's patches in quickly too.

Should that metric report svc_serv::sv_nrthreads instead of
the global thread count?


--
Chuck Lever

Jeffrey Layton Jan. 26, 2024, 3:35 p.m. UTC | #9

On Fri, 2024-01-26 at 15:16 +0000, Chuck Lever III wrote:
> 
> > On Jan 26, 2024, at 10:03 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Fri, 2024-01-26 at 14:27 +0000, Chuck Lever III wrote:
> > > 
> > > > On Jan 26, 2024, at 9:08 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
> > > > > 
> > > > > > On Jan 26, 2024, at 8:01 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > > > > > 
> > > > > > On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
> > > > > > > On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
> > > > > > > > On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> > > > > > > > > This is the last global stat, move it into nfsd_net and adjust all the
> > > > > > > > > users to use that variant instead of the global one.
> > > > > > > > 
> > > > > > > > Hm. I thought nfsd threads were a global resource -- they service
> > > > > > > > all network namespaces. So, shouldn't the same thread count be
> > > > > > > > surfaced to all containers? Won't they all see all of the nfsd
> > > > > > > > processes?
> > > > > > 
> > > > > > Each container is going to start /proc/fs/nfsd/threads number of threads
> > > > > > regardless. I hadn't actually grokked that they just get tossed onto the
> > > > > > pile of threads that service requests.
> > > > > > 
> > > > > > Is is possible for one container to start a small number of threads but
> > > > > > have its client load be such that it spills over and ends up stealing
> > > > > > threads from other containers?
> > > > > 
> > > > > I haven't seen any code that manages resources based on namespace,
> > > > > except in filecache.c to restrict writeback per namespace.
> > > > > 
> > > > > My impression is that any nfsd thread can serve any namespace. I'm
> > > > > not sure it is currently meaningful for a particular net namespace to
> > > > > "create" more threads.
> > > > > 
> > > > > If someone would like that level of control, we could implement a
> > > > > cgroup mechanism and have one or more separate svc_pools per net
> > > > > namespace, maybe? </hand wave>
> > > > > 
> > > > 
> > > > AFAICT, the total number of threads on the system will be the sum of the
> > > > threads started in each of the containers. They do just go into a big
> > > > pile, and whichever one wakes up will service the request, so the
> > > > threads aren't associated with the netns, per-se. The svc_rqst's however
> > > > _are_ per-netns. So, I don't see anything that ensures that a container
> > > > doesn't exceed the number of threads it started on its own behalf.
> > > > 
> > > > <hand wave>
> > > > I'm not sure we'd need to tie this in to cgroups.
> > > 
> > > Not a need, but cgroups are typically the way that such
> > > resource constraints are managed, that's all.
> > > 
> > > 
> > > > Now that Josef is
> > > > moving some of these key structures to be per-net, it should be fairly
> > > > simple to have nfsd() just look at the th_cnt and the thread count in
> > > > the current namespace, and just enqueue the RPC rather than doing it?
> > > > </hand wave>
> > > > 
> > > > OTOH, maybe I'm overly concerned here.
> > > > 
> > > > 
> > > > > 
> > > > > > > I don't think we want the network namespaces seeing how many threads exist in
> > > > > > > the entire system right?
> > > > > 
> > > > > If someone in a non-init net namespace does a "pgrep -c nfsd" don't
> > > > > they see the total nfsd thread count for the host?
> > > > > 
> > > > 
> > > > Yes, they're kernel threads and they aren't associated with a particular
> > > > pid namespace.
> > > > 
> > > > > 
> > > > > > > Additionally it appears that we can have multiple threads per network namespace,
> > > > > > > so it's not like this will just show 1 for each individual nn, it'll show
> > > > > > > however many threads have been configured for that nfsd in that network
> > > > > > > namespace.
> > > > > 
> > > > > I've never tried this, so I'm speculating. But it seems like for
> > > > > now, because all nfsd threads can serve all namespaces, they should
> > > > > all see the global thread count stat.
> > > > > 
> > > > > Then later we can refine it.
> > > > > 
> > > > 
> > > > I don't think that info is particularly useful though, and it certainly
> > > > breaks expectations wrt container isolation.
> > > > 
> > > > Look at it this way:
> > > > 
> > > > Suppose I have access to a container and I spin up nfsd with a
> > > > particular number of threads. I now want to know "did I spin up enough
> > > > threads?"
> > > 
> > > It makes sense to me for a container to ask for one or more
> > > NFSD listeners. But I'm not yet clear on what it means for
> > > a container to try to alter the NFSD thread count, which is
> > > currently global.
> > > 
> > 
> > write_threads sets the number of nfsd threads for the svc_serv, which is
> > a per-net object, so serv->sv_nrthreads is only counting the threads for
> > its namespace.
> 
> OK. I missed the part where struct svc_serv is per-net, and
> each svc_serv has its own thread pool. Got it.
> 
> 
> > Each container that starts an nfsd will contribute "threads" number of
> > nfsds to the pile. When you set "threads" to a different number, the
> > total pile is grown or reduced accordingly. In fact, I'm not even sure
> > we keep a systemwide total.
> > 
> > What _isn't_ done (unless I'm missing something) is any sort of
> > limitation on the use of those threads. So you could have a container
> > that starts one nfsd thread, but its clients can still have many more
> > RPCs in flight, up to the total number of nfsd's running on the host.
> > Limiting that might be possible, but I'm not yet convinced that it's a
> > good idea.
> > 
> > It might be interesting to gather some metrics though. How often do the
> > number of RPCs in flight exceed the number of threads that we started in
> > a namespace? Might also be good to gather a high-water mark too?
> 
> I had a bunch of trace points queued up for this purpose,
> and all that got thrown out by Neil's recent work fixing
> up our pool thread scheduler.
> 
> I'm not sure that queued requests are all that meaningful
> or even quantifiable. A more meaningful metric is round
> trip time measured on clients.
> 

I wouldn't try to queue anything just yet, but knowing when a particular
container is exceeding the number of threads it started seems like a
helpful metric. That could be done in a later patch though. I wouldn't
block this work on that.
 
> 
> > > > By making this per-namespace as Josef suggests it should be
> > > > fairly simple to tell whether my clients are regularly overrunning the
> > > > threads I started. With this info as global, I have no idea what netns
> > > > the RPCs being counted are against. I can't do anything with that info.
> > > 
> > > That's true. I'm just not sure we need to do anything
> > > significant about it as part of this patch series.
> > > 
> > > I'm not at all advocating leaving it like this. I agree it
> > > needs updating.
> > > 
> > > For now, just fill in that thread count stat with the global
> > > thread count, and later when we have a better concept for
> > > what it means to "adjust the nfsd thread count from inside a
> > > container" we can come back and make it make sense.
> > 
> > It would be good to step back and decide how we think this should work.
> > What would this look like if we were designing it today? Then we can
> > decide how to get there.
> > 
> > Personally, I think keeping the view inside the container as isolated as
> > possible is the right thing to do.
> 
> I agree. I would like to get Josef's patches in quickly too.
> 
> Should that metric report svc_serv::sv_nrthreads instead of
> the global thread count?
> 

No. The sv_nrthreads is just the same value that /proc/fs/nfsd/threads
shows. The _metric_ should show the th_cnt for that particular
namespace.

It would also be nice to know what the high-water mark for th_cnt is.
IOW, if my containers clients exceeded the number of threads I've
started, then it would be helpful to know by how much.

[v2,10/13] nfsd: move th_cnt into nfsd_net

Commit Message

Comments

Patch