mbox series

[00/10] exposing knfsd opens to userspace

Message ID 1556201060-7947-1-git-send-email-bfields@redhat.com
Headers show
Series exposing knfsd opens to userspace | expand

Message

J. Bruce Fields April 25, 2019, 2:04 p.m. UTC
From: "J. Bruce Fields" <bfields@redhat.com>

The following patches expose information about NFSv4 opens held by knfsd
on behalf of NFSv4 clients.  Those are currently invisible to userspace,
unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/).

The approach is to add a new directory /proc/fs/nfsd/clients/ with
subdirectories for each active NFSv4 client.  Each subdirectory has an
"info" file with some basic information to help identify the client and
an "opens" directory that lists the opens held by that client.

I got it working by cobbling together some poorly-understood code I
found in libfs, rpc_pipefs and elsewhere.  If anyone wants to wade in
and tell me what I've got wrong, they're more than welcome, but at this
stage I'm more curious for feedback on the interface.

I'm also cc'ing people responsible for lsof and util-linux in case they
have any opinions.

Currently these pseudofiles look like:

  # find /proc/fs/nfsd/clients -type f|xargs tail 
  ==> /proc/fs/nfsd/clients/3741/opens <==
  5cc0cd36/6debfb50/00000001/00000001	rw	--	fd:10:13649	'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef'
  5cc0cd36/6debfb50/00000003/00000001	r-	--	fd:10:13650	'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef'

  ==> /proc/fs/nfsd/clients/3741/info <==
  clientid: 6debfb505cc0cd36
  address: 192.168.122.36:0
  name: Linux NFSv4.2 test2.fieldses.org
  minor version: 2

Each line of the "opens" file is tab-delimited and describes one open,
and the fields are stateid, open access bits, deny bits,
major:minor:ino, and open owner.

So, some random questions:

	- I just copied the major:minor:ino thing from /proc/locks, I
	  suspect we would have picked something different to identify
	  inodes if /proc/locks were done now.  (Mount id and inode?
	  Something else?)

	- The open owner is just an opaque blob of binary data, but
	  clients may choose to include some useful asci-encoded
	  information, so I'm formatting them as strings with non-ascii
	  stuff escaped.  For example, pynfs usually uses the name of
	  the test as the open owner.  But as you see above, the ascii
	  content of the Linux client's open owners is less useful.
	  Also, there's no way I know of to map them back to a file
	  description or process or anything else useful on the client,
	  so perhaps they're of limited interest.

	- I'm not sure about the stateid either.  I did think it might
	  be useful just as a unique identifier for each line.
	  (Actually for that it'd be enough to take just the third of
	  those four numbers making up the stateid--maybe that would be
	  better.)

In the "info" file, the "name" line is the client identifier/client
owner provided by the client, which (like the stateowner) is just opaque
binary data, though as you can see here the Linux client is providing a
readable ascii string.

There's probably a lot more we could add to that info file eventually.

Other stuff to add next:

	- nfsd/clients/#/kill that you can write to to revoke all a
	  client's state if it's wedged somehow.
	- lists of locks and delegations; lower priority since most of
	  that information is already in /proc/locks.

--b.

J. Bruce Fields (10):
  nfsd: persist nfsd filesystem across mounts
  nfsd: rename cl_refcount
  nfsd4: use reference count to free client
  nfsd: add nfsd/clients directory
  nfsd: make client/ directory names small ints
  rpc: replace rpc_filelist by tree_descr
  nfsd4: add a client info file
  nfsd4: add file to display list of client's opens
  nfsd: expose some more information about NFSv4 opens
  nfsd: add more information to client info file

 fs/nfsd/netns.h                |   6 +
 fs/nfsd/nfs4state.c            | 228 ++++++++++++++++++++++++++++++---
 fs/nfsd/nfsctl.c               | 225 +++++++++++++++++++++++++++++++-
 fs/nfsd/nfsd.h                 |  11 ++
 fs/nfsd/state.h                |   9 +-
 fs/seq_file.c                  |  17 +++
 include/linux/seq_file.h       |   2 +
 include/linux/string_helpers.h |   1 +
 lib/string_helpers.c           |   5 +-
 net/sunrpc/rpc_pipe.c          |  37 ++----
 10 files changed, 491 insertions(+), 50 deletions(-)

Comments

Jeff Layton April 25, 2019, 5:02 p.m. UTC | #1
On Thu, 2019-04-25 at 10:04 -0400, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> The following patches expose information about NFSv4 opens held by knfsd
> on behalf of NFSv4 clients.  Those are currently invisible to userspace,
> unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/).
> 
> The approach is to add a new directory /proc/fs/nfsd/clients/ with
> subdirectories for each active NFSv4 client.  Each subdirectory has an
> "info" file with some basic information to help identify the client and
> an "opens" directory that lists the opens held by that client.
> 
> I got it working by cobbling together some poorly-understood code I
> found in libfs, rpc_pipefs and elsewhere.  If anyone wants to wade in
> and tell me what I've got wrong, they're more than welcome, but at this
> stage I'm more curious for feedback on the interface.
> 
> I'm also cc'ing people responsible for lsof and util-linux in case they
> have any opinions.
> 
> Currently these pseudofiles look like:
> 
>   # find /proc/fs/nfsd/clients -type f|xargs tail 
>   ==> /proc/fs/nfsd/clients/3741/opens <==
>   5cc0cd36/6debfb50/00000001/00000001	rw	--	fd:10:13649	'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef'
>   5cc0cd36/6debfb50/00000003/00000001	r-	--	fd:10:13650	'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef'
> 
>   ==> /proc/fs/nfsd/clients/3741/info <==
>   clientid: 6debfb505cc0cd36
>   address: 192.168.122.36:0
>   name: Linux NFSv4.2 test2.fieldses.org
>   minor version: 2
> 
> Each line of the "opens" file is tab-delimited and describes one open,
> and the fields are stateid, open access bits, deny bits,
> major:minor:ino, and open owner.
> 

Nice work! We've needed this for a long time.

One thing we need to consider here from the get-go though is what sort
of ABI guarantee you want for this format. People _will_ write scripts
that scrape this info, so we should take that into account up front.


> So, some random questions:
> 
> 	- I just copied the major:minor:ino thing from /proc/locks, I
> 	  suspect we would have picked something different to identify
> 	  inodes if /proc/locks were done now.  (Mount id and inode?
> 	  Something else?)
> 

That does make it easy to correlate with the info in /proc/locks.

We'd have a dentry here by virtue of the nfs4_file. Should we print a
path in addition to this?

> 	- The open owner is just an opaque blob of binary data, but
> 	  clients may choose to include some useful asci-encoded
> 	  information, so I'm formatting them as strings with non-ascii
> 	  stuff escaped.  For example, pynfs usually uses the name of
> 	  the test as the open owner.  But as you see above, the ascii
> 	  content of the Linux client's open owners is less useful.
> 	  Also, there's no way I know of to map them back to a file
> 	  description or process or anything else useful on the client,
> 	  so perhaps they're of limited interest.
> 
> 	- I'm not sure about the stateid either.  I did think it might
> 	  be useful just as a unique identifier for each line.
> 	  (Actually for that it'd be enough to take just the third of
> 	  those four numbers making up the stateid--maybe that would be
> 	  better.)
> 

It'd be ideal to be able to easily correlate this info with what
wireshark displays. Does wireshark display hashes for openowners? I know
it does for stateids. If so, generating the same hash would be really
nice.

That said, waybe it's best to just dump the raw info out here though and
rely on some postprocessing scripts for viewing it?

> In the "info" file, the "name" line is the client identifier/client
> owner provided by the client, which (like the stateowner) is just opaque
> binary data, though as you can see here the Linux client is providing a
> readable ascii string.
> 
> There's probably a lot more we could add to that info file eventually.
> 
> Other stuff to add next:
> 
> 	- nfsd/clients/#/kill that you can write to to revoke all a
> 	  client's state if it's wedged somehow.

That would also be neat. We have a bit of code to support today that in
the fault injection code, but it'll need some cleanup and wiring it into
a knob here would be better.

> 	- lists of locks and delegations; lower priority since most of
> 	  that information is already in /proc/locks.

Agreed.

> J. Bruce Fields (10):
>   nfsd: persist nfsd filesystem across mounts
>   nfsd: rename cl_refcount
>   nfsd4: use reference count to free client
>   nfsd: add nfsd/clients directory
>   nfsd: make client/ directory names small ints
>   rpc: replace rpc_filelist by tree_descr
>   nfsd4: add a client info file
>   nfsd4: add file to display list of client's opens
>   nfsd: expose some more information about NFSv4 opens
>   nfsd: add more information to client info file
> 
>  fs/nfsd/netns.h                |   6 +
>  fs/nfsd/nfs4state.c            | 228 ++++++++++++++++++++++++++++++---
>  fs/nfsd/nfsctl.c               | 225 +++++++++++++++++++++++++++++++-
>  fs/nfsd/nfsd.h                 |  11 ++
>  fs/nfsd/state.h                |   9 +-
>  fs/seq_file.c                  |  17 +++
>  include/linux/seq_file.h       |   2 +
>  include/linux/string_helpers.h |   1 +
>  lib/string_helpers.c           |   5 +-
>  net/sunrpc/rpc_pipe.c          |  37 ++----
>  10 files changed, 491 insertions(+), 50 deletions(-)
>
Jeff Layton April 25, 2019, 6:17 p.m. UTC | #2
On Thu, 2019-04-25 at 10:04 -0400, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> The following patches expose information about NFSv4 opens held by knfsd
> on behalf of NFSv4 clients.  Those are currently invisible to userspace,
> unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/).
> 
> The approach is to add a new directory /proc/fs/nfsd/clients/ with
> subdirectories for each active NFSv4 client.  Each subdirectory has an
> "info" file with some basic information to help identify the client and
> an "opens" directory that lists the opens held by that client.
> 
> I got it working by cobbling together some poorly-understood code I
> found in libfs, rpc_pipefs and elsewhere.  If anyone wants to wade in
> and tell me what I've got wrong, they're more than welcome, but at this
> stage I'm more curious for feedback on the interface.
> 
> I'm also cc'ing people responsible for lsof and util-linux in case they
> have any opinions.
> 
> Currently these pseudofiles look like:
> 
>   # find /proc/fs/nfsd/clients -type f|xargs tail 
>   ==> /proc/fs/nfsd/clients/3741/opens <==
>   5cc0cd36/6debfb50/00000001/00000001	rw	--	fd:10:13649	'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef'
>   5cc0cd36/6debfb50/00000003/00000001	r-	--	fd:10:13650	'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef'
> 
>   ==> /proc/fs/nfsd/clients/3741/info <==
>   clientid: 6debfb505cc0cd36
>   address: 192.168.122.36:0
>   name: Linux NFSv4.2 test2.fieldses.org
>   minor version: 2
> 
> Each line of the "opens" file is tab-delimited and describes one open,
> and the fields are stateid, open access bits, deny bits,
> major:minor:ino, and open owner.
> 
> So, some random questions:
> 
> 	- I just copied the major:minor:ino thing from /proc/locks, I
> 	  suspect we would have picked something different to identify
> 	  inodes if /proc/locks were done now.  (Mount id and inode?
> 	  Something else?)
> 
> 	- The open owner is just an opaque blob of binary data, but
> 	  clients may choose to include some useful asci-encoded
> 	  information, so I'm formatting them as strings with non-ascii
> 	  stuff escaped.  For example, pynfs usually uses the name of
> 	  the test as the open owner.  But as you see above, the ascii
> 	  content of the Linux client's open owners is less useful.
> 	  Also, there's no way I know of to map them back to a file
> 	  description or process or anything else useful on the client,
> 	  so perhaps they're of limited interest.
> 
> 	- I'm not sure about the stateid either.  I did think it might
> 	  be useful just as a unique identifier for each line.
> 	  (Actually for that it'd be enough to take just the third of
> 	  those four numbers making up the stateid--maybe that would be
> 	  better.)
> 
> In the "info" file, the "name" line is the client identifier/client
> owner provided by the client, which (like the stateowner) is just opaque
> binary data, though as you can see here the Linux client is providing a
> readable ascii string.
> 
> There's probably a lot more we could add to that info file eventually.
> 
> Other stuff to add next:
> 
> 	- nfsd/clients/#/kill that you can write to to revoke all a
> 	  client's state if it's wedged somehow.
> 	- lists of locks and delegations; lower priority since most of
> 	  that information is already in /proc/locks.
> 
> --b.
> 
> J. Bruce Fields (10):
>   nfsd: persist nfsd filesystem across mounts
>   nfsd: rename cl_refcount
>   nfsd4: use reference count to free client
>   nfsd: add nfsd/clients directory
>   nfsd: make client/ directory names small ints
>   rpc: replace rpc_filelist by tree_descr
>   nfsd4: add a client info file
>   nfsd4: add file to display list of client's opens
>   nfsd: expose some more information about NFSv4 opens
>   nfsd: add more information to client info file
> 
>  fs/nfsd/netns.h                |   6 +
>  fs/nfsd/nfs4state.c            | 228 ++++++++++++++++++++++++++++++---
>  fs/nfsd/nfsctl.c               | 225 +++++++++++++++++++++++++++++++-
>  fs/nfsd/nfsd.h                 |  11 ++
>  fs/nfsd/state.h                |   9 +-
>  fs/seq_file.c                  |  17 +++
>  include/linux/seq_file.h       |   2 +
>  include/linux/string_helpers.h |   1 +
>  lib/string_helpers.c           |   5 +-
>  net/sunrpc/rpc_pipe.c          |  37 ++----
>  10 files changed, 491 insertions(+), 50 deletions(-)
> 

I went through the patches and the code all looks fine to me. The only
real questions I think are what sort of info we want to present here,
and how we'll deal with changes to the format in the future.

Maybe on that latter point, we can reserve the right to add fields to
this info later, but do our best to never remove existing ones? Then
tools could be written to just ignore the fields that they don't
understand.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
J. Bruce Fields April 25, 2019, 8:01 p.m. UTC | #3
On Thu, Apr 25, 2019 at 01:02:30PM -0400, Jeff Layton wrote:
> On Thu, 2019-04-25 at 10:04 -0400, J. Bruce Fields wrote:
> > From: "J. Bruce Fields" <bfields@redhat.com>
> > 
> > The following patches expose information about NFSv4 opens held by knfsd
> > on behalf of NFSv4 clients.  Those are currently invisible to userspace,
> > unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/).
> > 
> > The approach is to add a new directory /proc/fs/nfsd/clients/ with
> > subdirectories for each active NFSv4 client.  Each subdirectory has an
> > "info" file with some basic information to help identify the client and
> > an "opens" directory that lists the opens held by that client.
> > 
> > I got it working by cobbling together some poorly-understood code I
> > found in libfs, rpc_pipefs and elsewhere.  If anyone wants to wade in
> > and tell me what I've got wrong, they're more than welcome, but at this
> > stage I'm more curious for feedback on the interface.
> > 
> > I'm also cc'ing people responsible for lsof and util-linux in case they
> > have any opinions.
> > 
> > Currently these pseudofiles look like:
> > 
> >   # find /proc/fs/nfsd/clients -type f|xargs tail 
> >   ==> /proc/fs/nfsd/clients/3741/opens <==
> >   5cc0cd36/6debfb50/00000001/00000001	rw	--	fd:10:13649	'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef'
> >   5cc0cd36/6debfb50/00000003/00000001	r-	--	fd:10:13650	'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef'
> > 
> >   ==> /proc/fs/nfsd/clients/3741/info <==
> >   clientid: 6debfb505cc0cd36
> >   address: 192.168.122.36:0
> >   name: Linux NFSv4.2 test2.fieldses.org
> >   minor version: 2
> > 
> > Each line of the "opens" file is tab-delimited and describes one open,
> > and the fields are stateid, open access bits, deny bits,
> > major:minor:ino, and open owner.
> > 
> 
> Nice work! We've needed this for a long time.
> 
> One thing we need to consider here from the get-go though is what sort
> of ABI guarantee you want for this format. People _will_ write scripts
> that scrape this info, so we should take that into account up front.

There is a man page for the nfsd filesystem, nfsd(7).  I should write up
something to add to that.  If people write code without reading that
then we may still end up boxed in, of course, but it's a start.

What I'm hoping we can count on from readers:

	- they will ignore any unkown files in clients/#/.
	- readers will ignore any lines in clients/#/info starting with
	  an unrecognized keyword.
	- they will ignore any unknown data at the end of
	  clients/#/opens.

That's in approximate decreasing order of my confidence in those rules
being observed, though I don't think any of those are too much to ask.

> > So, some random questions:
> > 
> > 	- I just copied the major:minor:ino thing from /proc/locks, I
> > 	  suspect we would have picked something different to identify
> > 	  inodes if /proc/locks were done now.  (Mount id and inode?
> > 	  Something else?)
> > 
> 
> That does make it easy to correlate with the info in /proc/locks.
> 
> We'd have a dentry here by virtue of the nfs4_file. Should we print a
> path in addition to this?

We could.  It won't be 100% reliable, of course (unlinks, renames), but
it could still be convenient for human readers, and an optimization for
non-human readers trying to find an inode.

The filehandle might be a good idea too.

I wonder if there's any issue with line length, or with quantity of data
emitted by a single seq_file show method.  The open owner can be up to
4K (after escaping), paths and filehandles can be long too.

> > 	- The open owner is just an opaque blob of binary data, but
> > 	  clients may choose to include some useful asci-encoded
> > 	  information, so I'm formatting them as strings with non-ascii
> > 	  stuff escaped.  For example, pynfs usually uses the name of
> > 	  the test as the open owner.  But as you see above, the ascii
> > 	  content of the Linux client's open owners is less useful.
> > 	  Also, there's no way I know of to map them back to a file
> > 	  description or process or anything else useful on the client,
> > 	  so perhaps they're of limited interest.
> > 
> > 	- I'm not sure about the stateid either.  I did think it might
> > 	  be useful just as a unique identifier for each line.
> > 	  (Actually for that it'd be enough to take just the third of
> > 	  those four numbers making up the stateid--maybe that would be
> > 	  better.)
> 
> It'd be ideal to be able to easily correlate this info with what
> wireshark displays. Does wireshark display hashes for openowners? I know
> it does for stateids. If so, generating the same hash would be really
> nice.
> 
> That said, waybe it's best to just dump the raw info out here though and
> rely on some postprocessing scripts for viewing it?

In that case, I think so, as I don't know how committed wireshark is to
the choice of hash.

> > In the "info" file, the "name" line is the client identifier/client
> > owner provided by the client, which (like the stateowner) is just opaque
> > binary data, though as you can see here the Linux client is providing a
> > readable ascii string.
> > 
> > There's probably a lot more we could add to that info file eventually.
> > 
> > Other stuff to add next:
> > 
> > 	- nfsd/clients/#/kill that you can write to to revoke all a
> > 	  client's state if it's wedged somehow.
> 
> That would also be neat. We have a bit of code to support today that in
> the fault injection code, but it'll need some cleanup and wiring it into
> a knob here would be better.

OK, good, I'm working on that.  Looks like fault injection gives up if
there are rpc's in process for the given client, whereas here I'd rather
force the expiry.  Looks like that needs some straightforward waitqueue
logic to wait for the in-progress rpc's.

--b.
Andreas Dilger April 25, 2019, 9:08 p.m. UTC | #4
On Apr 25, 2019, at 4:04 PM, J. Bruce Fields <bfields@redhat.com> wrote:
> 
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> The following patches expose information about NFSv4 opens held by knfsd
> on behalf of NFSv4 clients.  Those are currently invisible to userspace,
> unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/).
> 
> The approach is to add a new directory /proc/fs/nfsd/clients/ with
> subdirectories for each active NFSv4 client.  Each subdirectory has an
> "info" file with some basic information to help identify the client and
> an "opens" directory that lists the opens held by that client.
> 
> I got it working by cobbling together some poorly-understood code I
> found in libfs, rpc_pipefs and elsewhere.  If anyone wants to wade in
> and tell me what I've got wrong, they're more than welcome, but at this
> stage I'm more curious for feedback on the interface.

Is this in procfs, sysfs, or a separate NFSD-specific filesystem?
My understanding is that "complex" files are verboten in procfs and sysfs?
We've been going through a lengthy process to move files out of procfs
into sysfs and debugfs as a result (while trying to maintain some kind of
compatibility in the user tools), but if it is possible to use a separate
filesystem to hold all of the stats/parameters I'd much rather do that
than use debugfs (which has become root-access-only in newer kernels).

Cheers, Andreas

> I'm also cc'ing people responsible for lsof and util-linux in case they
> have any opinions.
> 
> Currently these pseudofiles look like:
> 
>  # find /proc/fs/nfsd/clients -type f|xargs tail
>  ==> /proc/fs/nfsd/clients/3741/opens <==
>  5cc0cd36/6debfb50/00000001/00000001	rw	--	fd:10:13649	'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef'
>  5cc0cd36/6debfb50/00000003/00000001	r-	--	fd:10:13650	'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef'
> 
>  ==> /proc/fs/nfsd/clients/3741/info <==
>  clientid: 6debfb505cc0cd36
>  address: 192.168.122.36:0
>  name: Linux NFSv4.2 test2.fieldses.org
>  minor version: 2
> 
> Each line of the "opens" file is tab-delimited and describes one open,
> and the fields are stateid, open access bits, deny bits,
> major:minor:ino, and open owner.
> 
> So, some random questions:
> 
> 	- I just copied the major:minor:ino thing from /proc/locks, I
> 	  suspect we would have picked something different to identify
> 	  inodes if /proc/locks were done now.  (Mount id and inode?
> 	  Something else?)
> 
> 	- The open owner is just an opaque blob of binary data, but
> 	  clients may choose to include some useful asci-encoded
> 	  information, so I'm formatting them as strings with non-ascii
> 	  stuff escaped.  For example, pynfs usually uses the name of
> 	  the test as the open owner.  But as you see above, the ascii
> 	  content of the Linux client's open owners is less useful.
> 	  Also, there's no way I know of to map them back to a file
> 	  description or process or anything else useful on the client,
> 	  so perhaps they're of limited interest.
> 
> 	- I'm not sure about the stateid either.  I did think it might
> 	  be useful just as a unique identifier for each line.
> 	  (Actually for that it'd be enough to take just the third of
> 	  those four numbers making up the stateid--maybe that would be
> 	  better.)
> 
> In the "info" file, the "name" line is the client identifier/client
> owner provided by the client, which (like the stateowner) is just opaque
> binary data, though as you can see here the Linux client is providing a
> readable ascii string.
> 
> There's probably a lot more we could add to that info file eventually.
> 
> Other stuff to add next:
> 
> 	- nfsd/clients/#/kill that you can write to to revoke all a
> 	  client's state if it's wedged somehow.
> 	- lists of locks and delegations; lower priority since most of
> 	  that information is already in /proc/locks.
> 
> --b.
> 
> J. Bruce Fields (10):
>  nfsd: persist nfsd filesystem across mounts
>  nfsd: rename cl_refcount
>  nfsd4: use reference count to free client
>  nfsd: add nfsd/clients directory
>  nfsd: make client/ directory names small ints
>  rpc: replace rpc_filelist by tree_descr
>  nfsd4: add a client info file
>  nfsd4: add file to display list of client's opens
>  nfsd: expose some more information about NFSv4 opens
>  nfsd: add more information to client info file
> 
> fs/nfsd/netns.h                |   6 +
> fs/nfsd/nfs4state.c            | 228 ++++++++++++++++++++++++++++++---
> fs/nfsd/nfsctl.c               | 225 +++++++++++++++++++++++++++++++-
> fs/nfsd/nfsd.h                 |  11 ++
> fs/nfsd/state.h                |   9 +-
> fs/seq_file.c                  |  17 +++
> include/linux/seq_file.h       |   2 +
> include/linux/string_helpers.h |   1 +
> lib/string_helpers.c           |   5 +-
> net/sunrpc/rpc_pipe.c          |  37 ++----
> 10 files changed, 491 insertions(+), 50 deletions(-)
> 
> --
> 2.20.1
> 


Cheers, Andreas
NeilBrown April 25, 2019, 11:20 p.m. UTC | #5
On Thu, Apr 25 2019, Andreas Dilger wrote:

> On Apr 25, 2019, at 4:04 PM, J. Bruce Fields <bfields@redhat.com> wrote:
>> 
>> From: "J. Bruce Fields" <bfields@redhat.com>
>> 
>> The following patches expose information about NFSv4 opens held by knfsd
>> on behalf of NFSv4 clients.  Those are currently invisible to userspace,
>> unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/).
>> 
>> The approach is to add a new directory /proc/fs/nfsd/clients/ with
>> subdirectories for each active NFSv4 client.  Each subdirectory has an
>> "info" file with some basic information to help identify the client and
>> an "opens" directory that lists the opens held by that client.
>> 
>> I got it working by cobbling together some poorly-understood code I
>> found in libfs, rpc_pipefs and elsewhere.  If anyone wants to wade in
>> and tell me what I've got wrong, they're more than welcome, but at this
>> stage I'm more curious for feedback on the interface.
>
> Is this in procfs, sysfs, or a separate NFSD-specific filesystem?
> My understanding is that "complex" files are verboten in procfs and sysfs?
> We've been going through a lengthy process to move files out of procfs
> into sysfs and debugfs as a result (while trying to maintain some kind of
> compatibility in the user tools), but if it is possible to use a separate
> filesystem to hold all of the stats/parameters I'd much rather do that
> than use debugfs (which has become root-access-only in newer kernels).

/proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific
filesystem, originally created to replace the nfsd-specific
systemcall.
So the nfsd developers have a fair degree of latitude as to what can go
in there.

But I *don't* think it is a good idea to follow this pattern.  Creating
a separate control filesystem for every different module that thinks it
has different needs doesn't scale well.  We could end up with dozens of
tiny filesystems that all need to be mounted at just the right place.  I
don't think that is healthy for Linus.

Nor do I think we should be stuffing stuff into debugfs that isn't
really for debugging.  That isn't healthy either.

If sysfs doesn't meet our needs, then we need to raise that in
appropriate fora and present a clear case and try to build consensus -
because if we see a problem, then it is likely that others do to.

This is all presumably in the context of Lustre and while lustre is
out-of-tree we don't have a lot of leverage.  So I wouldn't consider
pursuing anything here until we get back upstream.

NeilBrown
Andreas Dilger April 26, 2019, 11 a.m. UTC | #6
> On Apr 26, 2019, at 1:20 AM, NeilBrown <neilb@suse.com> wrote:
> 
> On Thu, Apr 25 2019, Andreas Dilger wrote:
> 
>> On Apr 25, 2019, at 4:04 PM, J. Bruce Fields <bfields@redhat.com> wrote:
>>> 
>>> From: "J. Bruce Fields" <bfields@redhat.com>
>>> 
>>> The following patches expose information about NFSv4 opens held by knfsd
>>> on behalf of NFSv4 clients.  Those are currently invisible to userspace,
>>> unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/).
>>> 
>>> The approach is to add a new directory /proc/fs/nfsd/clients/ with
>>> subdirectories for each active NFSv4 client.  Each subdirectory has an
>>> "info" file with some basic information to help identify the client and
>>> an "opens" directory that lists the opens held by that client.
>>> 
>>> I got it working by cobbling together some poorly-understood code I
>>> found in libfs, rpc_pipefs and elsewhere.  If anyone wants to wade in
>>> and tell me what I've got wrong, they're more than welcome, but at this
>>> stage I'm more curious for feedback on the interface.
>> 
>> Is this in procfs, sysfs, or a separate NFSD-specific filesystem?
>> My understanding is that "complex" files are verboten in procfs and sysfs?
>> We've been going through a lengthy process to move files out of procfs
>> into sysfs and debugfs as a result (while trying to maintain some kind of
>> compatibility in the user tools), but if it is possible to use a separate
>> filesystem to hold all of the stats/parameters I'd much rather do that
>> than use debugfs (which has become root-access-only in newer kernels).
> 
> /proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific
> filesystem, originally created to replace the nfsd-specific systemcall.
> So the nfsd developers have a fair degree of latitude as to what can go
> in there.
> 
> But I *don't* think it is a good idea to follow this pattern.  Creating
> a separate control filesystem for every different module that thinks it
> has different needs doesn't scale well.  We could end up with dozens of
> tiny filesystems that all need to be mounted at just the right place.  I
> don't think that is healthy for Linus.
> 
> Nor do I think we should be stuffing stuff into debugfs that isn't
> really for debugging.  That isn't healthy either.
> 
> If sysfs doesn't meet our needs, then we need to raise that in
> appropriate fora and present a clear case and try to build consensus -
> because if we see a problem, then it is likely that others do to.

I definitely *do* see the restrictions sysfs as being a problem, and I'd
guess NFS developers thought the same, since the "one value per file"
paradigm means that any kind of complex data needs to be split over
hundreds or thousands of files, which is very inefficient for userspace to
use.  Consider if /proc/slabinfo had to follow the sysfs paradigm, this would
(on my system) need about 225 directories (one per slab) and 3589 separate
files in total (one per value) that would need to be read every second to
implement "slabtop".  Running strace on "top" shows it taking 0.25s wall time
to open and read the files for only 350 processes on my system, at 2 files
per process ("stat" and "statm"), and those have 44 and 7 values, respectively,
so if it had to follow the sysfs paradigm would make this far worse.

I think it would make a lot more sense to have one file per item of interest,
and make it e.g. a well-structured YAML format ("name: value", with indentation
denoting a hierarchy/grouping of related items) so that it can be both human
and machine readable, easily parsed by scripts using bash or awk, rather than
having an explicit directory+file hierarchy.  Files like /proc/meminfo and
/proc/<pid>/status are already YAML-formatted (or almost so), so it isn't ugly
like XML encoding.

> This is all presumably in the context of Lustre and while lustre is
> out-of-tree we don't have a lot of leverage.  So I wouldn't consider
> pursuing anything here until we get back upstream.

Sure, except that is a catch-22.  We can't discuss what is needed until
the code is in the kernel, but we can't get it into the kernel until the
files it puts in /proc have been moved into /sys?

Cheers, Andreas
J. Bruce Fields April 26, 2019, 12:56 p.m. UTC | #7
On Fri, Apr 26, 2019 at 01:00:19PM +0200, Andreas Dilger wrote:
> 
> On Apr 26, 2019, at 1:20 AM, NeilBrown <neilb@suse.com> wrote:
> > /proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific
> > filesystem, originally created to replace the nfsd-specific systemcall.
> > So the nfsd developers have a fair degree of latitude as to what can go
> > in there.
> > 
> > But I *don't* think it is a good idea to follow this pattern.  Creating
> > a separate control filesystem for every different module that thinks it
> > has different needs doesn't scale well.  We could end up with dozens of
> > tiny filesystems that all need to be mounted at just the right place.

Aren't we already there?  My laptop, Fedora 29 with everything pretty much
default:

$  findmnt -n -oFSTYPE|sort|uniq -c
      1 autofs
      1 bpf
     11 cgroup
      1 cgroup2
      1 configfs
      1 debugfs
      1 devpts
      1 devtmpfs
      3 ext4
      1 fusectl
      1 fuse.gvfsd-fuse
      1 hugetlbfs
      1 mqueue
      1 proc
      1 pstore
      1 rpc_pipefs
      1 securityfs
      1 selinuxfs
      1 sysfs
      5 tmpfs

> > I don't think that is healthy for Linus.

What are the problems you see?

> > Nor do I think we should be stuffing stuff into debugfs that isn't
> > really for debugging.  That isn't healthy either.
> > 
> > If sysfs doesn't meet our needs, then we need to raise that in
> > appropriate fora and present a clear case and try to build consensus -
> > because if we see a problem, then it is likely that others do to.
> 
> I definitely *do* see the restrictions sysfs as being a problem, and I'd
> guess NFS developers thought the same,

For what it's worth, the nfsd filesystem predated sysfs, just barely.

Looking at the history....  It was actually Al that introduced it in March
2002.  Patrick Mochel added sysfs in October 2002.

But it's true that from the start nfsd didn't really fit the model of a single
(possibly writeable) attribute per file.

(Might be interesting to look at the distribution of new filesystem types over
time, there may have been a peak around then.)

--b.
NeilBrown April 26, 2019, 11:55 p.m. UTC | #8
On Fri, Apr 26 2019, J. Bruce Fields wrote:

> On Fri, Apr 26, 2019 at 01:00:19PM +0200, Andreas Dilger wrote:
>> 
>> On Apr 26, 2019, at 1:20 AM, NeilBrown <neilb@suse.com> wrote:
>> > /proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific
>> > filesystem, originally created to replace the nfsd-specific systemcall.
>> > So the nfsd developers have a fair degree of latitude as to what can go
>> > in there.
>> > 
>> > But I *don't* think it is a good idea to follow this pattern.  Creating
>> > a separate control filesystem for every different module that thinks it
>> > has different needs doesn't scale well.  We could end up with dozens of
>> > tiny filesystems that all need to be mounted at just the right place.
>
> Aren't we already there?  My laptop, Fedora 29 with everything pretty much
> default:
>
> $  findmnt -n -oFSTYPE|sort|uniq -c
>       1 autofs
>       1 bpf
>      11 cgroup
>       1 cgroup2
>       1 configfs
>       1 debugfs
>       1 devpts
>       1 devtmpfs
>       3 ext4
>       1 fusectl
>       1 fuse.gvfsd-fuse
>       1 hugetlbfs
>       1 mqueue
>       1 proc
>       1 pstore
>       1 rpc_pipefs
>       1 securityfs
>       1 selinuxfs
>       1 sysfs
>       5 tmpfs

Sometimes I think "Linux is so full of crap, it hardly matters if more
crap is added".  Other times I feel a bit more social responsibility and
want to fight against adding too much more crap.

>
>> > I don't think that is healthy for Linus.
                                       ^^^^^ oops
>
> What are the problems you see?

Fragmentation.
This is exactly the platform problem.  People look at the existing
functionality, decide that it doesn't quite meet their needs, and then
do an implement something that is mostly the same but just different
enough so that they feel more comfortable.

We have 'seq_file' which encourages people to create line-oriented info
files, but if some have tab-separate lines, others have CSV, others have
yaml etc, then there is plenty of room for confusion.

If we could, instead, just agreed that more structured data actually can
make sense in sysfs, and come up with a coherent approach that we can
all tolerate, then life would be much easier.

>
>> > Nor do I think we should be stuffing stuff into debugfs that isn't
>> > really for debugging.  That isn't healthy either.
>> > 
>> > If sysfs doesn't meet our needs, then we need to raise that in
>> > appropriate fora and present a clear case and try to build consensus -
>> > because if we see a problem, then it is likely that others do to.
>> 
>> I definitely *do* see the restrictions sysfs as being a problem, and I'd
>> guess NFS developers thought the same,
>
> For what it's worth, the nfsd filesystem predated sysfs, just barely.
>
> Looking at the history....  It was actually Al that introduced it in March
> 2002.  Patrick Mochel added sysfs in October 2002.

IIRC the nfsd filesystem was added to address the difficulty of
supporting a systemcall in a module.  That basically doesn't work, so
there needed to be some clean interface between some minimal nfsdctl systemcall
code that was always compiled in, and the actually implementation that
could be in a module.  It was Viro who particularly noticed this need,
so it was a filesystem that was chosen to fill the need - filesystems in
modules was already a solved problem.
The system call was (I think) write-only (no non-trivial return data)
and each file just recieved a binary struct that matches the syscall
argument.
Once that was in place, it became a dumping ground for whatever we
wanted.

sysfs was, I think, originally about power management.  It exposed
connections between devices more than the devices themselves.  This
allowed user-space to be able to power-things down when not in use, and
to understand the dependencies.
Since then it grew....

>
> But it's true that from the start nfsd didn't really fit the model of a single
> (possibly writeable) attribute per file.

Depends on what you mean by that.  Original files where write-only and
where slightly complex attributes. Writing performed an action, like
adding an entry to the export table (first you add a client, then add a
client+filesystem to export it).

This idea for a file performing an action, rather than presenting an
attribute, is much the same as the "bind" and "unbind" files you can
find in sysfs.

(see also https://lwn.net/Articles/378884/ for examples of sysfs files
that are not one-attribute-per-file)

NeilBrown

>
> (Might be interesting to look at the distribution of new filesystem types over
> time, there may have been a peak around then.)
>
> --b.
NeilBrown April 27, 2019, 12:03 a.m. UTC | #9
On Fri, Apr 26 2019, Andreas Dilger wrote:

>> On Apr 26, 2019, at 1:20 AM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Thu, Apr 25 2019, Andreas Dilger wrote:
>> 
>>> On Apr 25, 2019, at 4:04 PM, J. Bruce Fields <bfields@redhat.com> wrote:
>>>> 
>>>> From: "J. Bruce Fields" <bfields@redhat.com>
>>>> 
>>>> The following patches expose information about NFSv4 opens held by knfsd
>>>> on behalf of NFSv4 clients.  Those are currently invisible to userspace,
>>>> unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/).
>>>> 
>>>> The approach is to add a new directory /proc/fs/nfsd/clients/ with
>>>> subdirectories for each active NFSv4 client.  Each subdirectory has an
>>>> "info" file with some basic information to help identify the client and
>>>> an "opens" directory that lists the opens held by that client.
>>>> 
>>>> I got it working by cobbling together some poorly-understood code I
>>>> found in libfs, rpc_pipefs and elsewhere.  If anyone wants to wade in
>>>> and tell me what I've got wrong, they're more than welcome, but at this
>>>> stage I'm more curious for feedback on the interface.
>>> 
>>> Is this in procfs, sysfs, or a separate NFSD-specific filesystem?
>>> My understanding is that "complex" files are verboten in procfs and sysfs?
>>> We've been going through a lengthy process to move files out of procfs
>>> into sysfs and debugfs as a result (while trying to maintain some kind of
>>> compatibility in the user tools), but if it is possible to use a separate
>>> filesystem to hold all of the stats/parameters I'd much rather do that
>>> than use debugfs (which has become root-access-only in newer kernels).
>> 
>> /proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific
>> filesystem, originally created to replace the nfsd-specific systemcall.
>> So the nfsd developers have a fair degree of latitude as to what can go
>> in there.
>> 
>> But I *don't* think it is a good idea to follow this pattern.  Creating
>> a separate control filesystem for every different module that thinks it
>> has different needs doesn't scale well.  We could end up with dozens of
>> tiny filesystems that all need to be mounted at just the right place.  I
>> don't think that is healthy for Linus.
>> 
>> Nor do I think we should be stuffing stuff into debugfs that isn't
>> really for debugging.  That isn't healthy either.
>> 
>> If sysfs doesn't meet our needs, then we need to raise that in
>> appropriate fora and present a clear case and try to build consensus -
>> because if we see a problem, then it is likely that others do to.
>
> I definitely *do* see the restrictions sysfs as being a problem, and I'd
> guess NFS developers thought the same, since the "one value per file"
> paradigm means that any kind of complex data needs to be split over
> hundreds or thousands of files, which is very inefficient for userspace to
> use.  Consider if /proc/slabinfo had to follow the sysfs paradigm, this would
> (on my system) need about 225 directories (one per slab) and 3589 separate
> files in total (one per value) that would need to be read every second to
> implement "slabtop".  Running strace on "top" shows it taking 0.25s wall time
> to open and read the files for only 350 processes on my system, at 2 files
> per process ("stat" and "statm"), and those have 44 and 7 values, respectively,
> so if it had to follow the sysfs paradigm would make this far worse.
>
> I think it would make a lot more sense to have one file per item of interest,
> and make it e.g. a well-structured YAML format ("name: value", with indentation
> denoting a hierarchy/grouping of related items) so that it can be both human
> and machine readable, easily parsed by scripts using bash or awk, rather than
> having an explicit directory+file hierarchy.  Files like /proc/meminfo and
> /proc/<pid>/status are already YAML-formatted (or almost so), so it isn't ugly
> like XML encoding.

So what are your pain points?  What data do you really want to present in
a structured file?
Look at /proc/self/mountstats on some machine which has an NFS mount.
There would be no problem adding similar information for lustre mounts.

What data do you want to export to user-space, which wouldn't fit there
and doesn't fit the one-value-per-file model.  To make a case, we need
concrete data.

>
>> This is all presumably in the context of Lustre and while lustre is
>> out-of-tree we don't have a lot of leverage.  So I wouldn't consider
>> pursuing anything here until we get back upstream.
>
> Sure, except that is a catch-22.  We can't discuss what is needed until
> the code is in the kernel, but we can't get it into the kernel until the
> files it puts in /proc have been moved into /sys?

Or maybe just removed.  If lustre is usable without some of these files,
then we can land lustre without them, and then start the conversation
about how to export the data that we want exported.

NeilBrown
J. Bruce Fields April 27, 2019, 7 p.m. UTC | #10
On Sat, Apr 27, 2019 at 09:55:23AM +1000, NeilBrown wrote:
> On Fri, Apr 26 2019, J. Bruce Fields wrote:
> > But it's true that from the start nfsd didn't really fit the model
> > of a single (possibly writeable) attribute per file.
> 
> Depends on what you mean by that.  Original files where write-only and
> where slightly complex attributes.

Yes I thought it was just those too, but then I looked at the original
commit it also included at least the "exports" file.

> Writing performed an action, like
> adding an entry to the export table (first you add a client, then add a
> client+filesystem to export it).
> 
> This idea for a file performing an action, rather than presenting an
> attribute, is much the same as the "bind" and "unbind" files you can
> find in sysfs.
> 
> (see also https://lwn.net/Articles/378884/ for examples of sysfs files
> that are not one-attribute-per-file)

I'll give that a re-read, thanks.

I did spend maybe a few minutes looking into basing nfsd code on kernfs
and didn't think it was worth it.  I could take a more serious look.

--b.
NeilBrown April 28, 2019, 10:57 p.m. UTC | #11
On Sat, Apr 27 2019, J. Bruce Fields wrote:

> On Sat, Apr 27, 2019 at 09:55:23AM +1000, NeilBrown wrote:
>> On Fri, Apr 26 2019, J. Bruce Fields wrote:
>> > But it's true that from the start nfsd didn't really fit the model
>> > of a single (possibly writeable) attribute per file.
>> 
>> Depends on what you mean by that.  Original files where write-only and
>> where slightly complex attributes.
>
> Yes I thought it was just those too, but then I looked at the original
> commit it also included at least the "exports" file.

Maybe it depends on how one chooses to interpret history - never an
exact science.

The "exports" file pre-existed the nfsd filesystem - it was (and still
is) a file in procfs: /proc/fs/nfs/exports.  So the nfsd filesystem was
not created to provide that file.  It was created to replace a
systemcall.
As I said, it subsequently had a variety of things added to it.  exports
was just the first of these.  At least, that is how I choose to see it.

>
>> Writing performed an action, like
>> adding an entry to the export table (first you add a client, then add a
>> client+filesystem to export it).
>> 
>> This idea for a file performing an action, rather than presenting an
>> attribute, is much the same as the "bind" and "unbind" files you can
>> find in sysfs.
>> 
>> (see also https://lwn.net/Articles/378884/ for examples of sysfs files
>> that are not one-attribute-per-file)
>
> I'll give that a re-read, thanks.
>
> I did spend maybe a few minutes looking into basing nfsd code on kernfs
> and didn't think it was worth it.  I could take a more serious look.

I think that in your use-case it make lots of sense to have a structured
file for the "opens" (similar to /proc/locks and /proc/mounts).
The "info" could reasonably be several attribute files (clientid,
address, name, minor_version), but I don't think it benefits anyone for
'opens' to be a directory full of directories each with a file for each
of the fields.

So using kernfs would mean pushing for allowing structured files in
kernfs.  I'd be happy to support that, but I think you would need to go
into it convinced that you really wanted to persist.

I think using kernfs is mostly about embedding a kobject in everything,
then setting an appropriate ktype with appropriate attributes.  Not
particularly complex, but certainly a bit of work.

Thanks,
NeilBrown