Message ID | 1556201060-7947-1-git-send-email-bfields@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | exposing knfsd opens to userspace | expand |
On Thu, 2019-04-25 at 10:04 -0400, J. Bruce Fields wrote: > From: "J. Bruce Fields" <bfields@redhat.com> > > The following patches expose information about NFSv4 opens held by knfsd > on behalf of NFSv4 clients. Those are currently invisible to userspace, > unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/). > > The approach is to add a new directory /proc/fs/nfsd/clients/ with > subdirectories for each active NFSv4 client. Each subdirectory has an > "info" file with some basic information to help identify the client and > an "opens" directory that lists the opens held by that client. > > I got it working by cobbling together some poorly-understood code I > found in libfs, rpc_pipefs and elsewhere. If anyone wants to wade in > and tell me what I've got wrong, they're more than welcome, but at this > stage I'm more curious for feedback on the interface. > > I'm also cc'ing people responsible for lsof and util-linux in case they > have any opinions. > > Currently these pseudofiles look like: > > # find /proc/fs/nfsd/clients -type f|xargs tail > ==> /proc/fs/nfsd/clients/3741/opens <== > 5cc0cd36/6debfb50/00000001/00000001 rw -- fd:10:13649 'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef' > 5cc0cd36/6debfb50/00000003/00000001 r- -- fd:10:13650 'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef' > > ==> /proc/fs/nfsd/clients/3741/info <== > clientid: 6debfb505cc0cd36 > address: 192.168.122.36:0 > name: Linux NFSv4.2 test2.fieldses.org > minor version: 2 > > Each line of the "opens" file is tab-delimited and describes one open, > and the fields are stateid, open access bits, deny bits, > major:minor:ino, and open owner. > Nice work! We've needed this for a long time. One thing we need to consider here from the get-go though is what sort of ABI guarantee you want for this format. People _will_ write scripts that scrape this info, so we should take that into account up front. > So, some random questions: > > - I just copied the major:minor:ino thing from /proc/locks, I > suspect we would have picked something different to identify > inodes if /proc/locks were done now. (Mount id and inode? > Something else?) > That does make it easy to correlate with the info in /proc/locks. We'd have a dentry here by virtue of the nfs4_file. Should we print a path in addition to this? > - The open owner is just an opaque blob of binary data, but > clients may choose to include some useful asci-encoded > information, so I'm formatting them as strings with non-ascii > stuff escaped. For example, pynfs usually uses the name of > the test as the open owner. But as you see above, the ascii > content of the Linux client's open owners is less useful. > Also, there's no way I know of to map them back to a file > description or process or anything else useful on the client, > so perhaps they're of limited interest. > > - I'm not sure about the stateid either. I did think it might > be useful just as a unique identifier for each line. > (Actually for that it'd be enough to take just the third of > those four numbers making up the stateid--maybe that would be > better.) > It'd be ideal to be able to easily correlate this info with what wireshark displays. Does wireshark display hashes for openowners? I know it does for stateids. If so, generating the same hash would be really nice. That said, waybe it's best to just dump the raw info out here though and rely on some postprocessing scripts for viewing it? > In the "info" file, the "name" line is the client identifier/client > owner provided by the client, which (like the stateowner) is just opaque > binary data, though as you can see here the Linux client is providing a > readable ascii string. > > There's probably a lot more we could add to that info file eventually. > > Other stuff to add next: > > - nfsd/clients/#/kill that you can write to to revoke all a > client's state if it's wedged somehow. That would also be neat. We have a bit of code to support today that in the fault injection code, but it'll need some cleanup and wiring it into a knob here would be better. > - lists of locks and delegations; lower priority since most of > that information is already in /proc/locks. Agreed. > J. Bruce Fields (10): > nfsd: persist nfsd filesystem across mounts > nfsd: rename cl_refcount > nfsd4: use reference count to free client > nfsd: add nfsd/clients directory > nfsd: make client/ directory names small ints > rpc: replace rpc_filelist by tree_descr > nfsd4: add a client info file > nfsd4: add file to display list of client's opens > nfsd: expose some more information about NFSv4 opens > nfsd: add more information to client info file > > fs/nfsd/netns.h | 6 + > fs/nfsd/nfs4state.c | 228 ++++++++++++++++++++++++++++++--- > fs/nfsd/nfsctl.c | 225 +++++++++++++++++++++++++++++++- > fs/nfsd/nfsd.h | 11 ++ > fs/nfsd/state.h | 9 +- > fs/seq_file.c | 17 +++ > include/linux/seq_file.h | 2 + > include/linux/string_helpers.h | 1 + > lib/string_helpers.c | 5 +- > net/sunrpc/rpc_pipe.c | 37 ++---- > 10 files changed, 491 insertions(+), 50 deletions(-) >
On Thu, 2019-04-25 at 10:04 -0400, J. Bruce Fields wrote: > From: "J. Bruce Fields" <bfields@redhat.com> > > The following patches expose information about NFSv4 opens held by knfsd > on behalf of NFSv4 clients. Those are currently invisible to userspace, > unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/). > > The approach is to add a new directory /proc/fs/nfsd/clients/ with > subdirectories for each active NFSv4 client. Each subdirectory has an > "info" file with some basic information to help identify the client and > an "opens" directory that lists the opens held by that client. > > I got it working by cobbling together some poorly-understood code I > found in libfs, rpc_pipefs and elsewhere. If anyone wants to wade in > and tell me what I've got wrong, they're more than welcome, but at this > stage I'm more curious for feedback on the interface. > > I'm also cc'ing people responsible for lsof and util-linux in case they > have any opinions. > > Currently these pseudofiles look like: > > # find /proc/fs/nfsd/clients -type f|xargs tail > ==> /proc/fs/nfsd/clients/3741/opens <== > 5cc0cd36/6debfb50/00000001/00000001 rw -- fd:10:13649 'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef' > 5cc0cd36/6debfb50/00000003/00000001 r- -- fd:10:13650 'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef' > > ==> /proc/fs/nfsd/clients/3741/info <== > clientid: 6debfb505cc0cd36 > address: 192.168.122.36:0 > name: Linux NFSv4.2 test2.fieldses.org > minor version: 2 > > Each line of the "opens" file is tab-delimited and describes one open, > and the fields are stateid, open access bits, deny bits, > major:minor:ino, and open owner. > > So, some random questions: > > - I just copied the major:minor:ino thing from /proc/locks, I > suspect we would have picked something different to identify > inodes if /proc/locks were done now. (Mount id and inode? > Something else?) > > - The open owner is just an opaque blob of binary data, but > clients may choose to include some useful asci-encoded > information, so I'm formatting them as strings with non-ascii > stuff escaped. For example, pynfs usually uses the name of > the test as the open owner. But as you see above, the ascii > content of the Linux client's open owners is less useful. > Also, there's no way I know of to map them back to a file > description or process or anything else useful on the client, > so perhaps they're of limited interest. > > - I'm not sure about the stateid either. I did think it might > be useful just as a unique identifier for each line. > (Actually for that it'd be enough to take just the third of > those four numbers making up the stateid--maybe that would be > better.) > > In the "info" file, the "name" line is the client identifier/client > owner provided by the client, which (like the stateowner) is just opaque > binary data, though as you can see here the Linux client is providing a > readable ascii string. > > There's probably a lot more we could add to that info file eventually. > > Other stuff to add next: > > - nfsd/clients/#/kill that you can write to to revoke all a > client's state if it's wedged somehow. > - lists of locks and delegations; lower priority since most of > that information is already in /proc/locks. > > --b. > > J. Bruce Fields (10): > nfsd: persist nfsd filesystem across mounts > nfsd: rename cl_refcount > nfsd4: use reference count to free client > nfsd: add nfsd/clients directory > nfsd: make client/ directory names small ints > rpc: replace rpc_filelist by tree_descr > nfsd4: add a client info file > nfsd4: add file to display list of client's opens > nfsd: expose some more information about NFSv4 opens > nfsd: add more information to client info file > > fs/nfsd/netns.h | 6 + > fs/nfsd/nfs4state.c | 228 ++++++++++++++++++++++++++++++--- > fs/nfsd/nfsctl.c | 225 +++++++++++++++++++++++++++++++- > fs/nfsd/nfsd.h | 11 ++ > fs/nfsd/state.h | 9 +- > fs/seq_file.c | 17 +++ > include/linux/seq_file.h | 2 + > include/linux/string_helpers.h | 1 + > lib/string_helpers.c | 5 +- > net/sunrpc/rpc_pipe.c | 37 ++---- > 10 files changed, 491 insertions(+), 50 deletions(-) > I went through the patches and the code all looks fine to me. The only real questions I think are what sort of info we want to present here, and how we'll deal with changes to the format in the future. Maybe on that latter point, we can reserve the right to add fields to this info later, but do our best to never remove existing ones? Then tools could be written to just ignore the fields that they don't understand. Reviewed-by: Jeff Layton <jlayton@redhat.com>
On Thu, Apr 25, 2019 at 01:02:30PM -0400, Jeff Layton wrote: > On Thu, 2019-04-25 at 10:04 -0400, J. Bruce Fields wrote: > > From: "J. Bruce Fields" <bfields@redhat.com> > > > > The following patches expose information about NFSv4 opens held by knfsd > > on behalf of NFSv4 clients. Those are currently invisible to userspace, > > unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/). > > > > The approach is to add a new directory /proc/fs/nfsd/clients/ with > > subdirectories for each active NFSv4 client. Each subdirectory has an > > "info" file with some basic information to help identify the client and > > an "opens" directory that lists the opens held by that client. > > > > I got it working by cobbling together some poorly-understood code I > > found in libfs, rpc_pipefs and elsewhere. If anyone wants to wade in > > and tell me what I've got wrong, they're more than welcome, but at this > > stage I'm more curious for feedback on the interface. > > > > I'm also cc'ing people responsible for lsof and util-linux in case they > > have any opinions. > > > > Currently these pseudofiles look like: > > > > # find /proc/fs/nfsd/clients -type f|xargs tail > > ==> /proc/fs/nfsd/clients/3741/opens <== > > 5cc0cd36/6debfb50/00000001/00000001 rw -- fd:10:13649 'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef' > > 5cc0cd36/6debfb50/00000003/00000001 r- -- fd:10:13650 'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef' > > > > ==> /proc/fs/nfsd/clients/3741/info <== > > clientid: 6debfb505cc0cd36 > > address: 192.168.122.36:0 > > name: Linux NFSv4.2 test2.fieldses.org > > minor version: 2 > > > > Each line of the "opens" file is tab-delimited and describes one open, > > and the fields are stateid, open access bits, deny bits, > > major:minor:ino, and open owner. > > > > Nice work! We've needed this for a long time. > > One thing we need to consider here from the get-go though is what sort > of ABI guarantee you want for this format. People _will_ write scripts > that scrape this info, so we should take that into account up front. There is a man page for the nfsd filesystem, nfsd(7). I should write up something to add to that. If people write code without reading that then we may still end up boxed in, of course, but it's a start. What I'm hoping we can count on from readers: - they will ignore any unkown files in clients/#/. - readers will ignore any lines in clients/#/info starting with an unrecognized keyword. - they will ignore any unknown data at the end of clients/#/opens. That's in approximate decreasing order of my confidence in those rules being observed, though I don't think any of those are too much to ask. > > So, some random questions: > > > > - I just copied the major:minor:ino thing from /proc/locks, I > > suspect we would have picked something different to identify > > inodes if /proc/locks were done now. (Mount id and inode? > > Something else?) > > > > That does make it easy to correlate with the info in /proc/locks. > > We'd have a dentry here by virtue of the nfs4_file. Should we print a > path in addition to this? We could. It won't be 100% reliable, of course (unlinks, renames), but it could still be convenient for human readers, and an optimization for non-human readers trying to find an inode. The filehandle might be a good idea too. I wonder if there's any issue with line length, or with quantity of data emitted by a single seq_file show method. The open owner can be up to 4K (after escaping), paths and filehandles can be long too. > > - The open owner is just an opaque blob of binary data, but > > clients may choose to include some useful asci-encoded > > information, so I'm formatting them as strings with non-ascii > > stuff escaped. For example, pynfs usually uses the name of > > the test as the open owner. But as you see above, the ascii > > content of the Linux client's open owners is less useful. > > Also, there's no way I know of to map them back to a file > > description or process or anything else useful on the client, > > so perhaps they're of limited interest. > > > > - I'm not sure about the stateid either. I did think it might > > be useful just as a unique identifier for each line. > > (Actually for that it'd be enough to take just the third of > > those four numbers making up the stateid--maybe that would be > > better.) > > It'd be ideal to be able to easily correlate this info with what > wireshark displays. Does wireshark display hashes for openowners? I know > it does for stateids. If so, generating the same hash would be really > nice. > > That said, waybe it's best to just dump the raw info out here though and > rely on some postprocessing scripts for viewing it? In that case, I think so, as I don't know how committed wireshark is to the choice of hash. > > In the "info" file, the "name" line is the client identifier/client > > owner provided by the client, which (like the stateowner) is just opaque > > binary data, though as you can see here the Linux client is providing a > > readable ascii string. > > > > There's probably a lot more we could add to that info file eventually. > > > > Other stuff to add next: > > > > - nfsd/clients/#/kill that you can write to to revoke all a > > client's state if it's wedged somehow. > > That would also be neat. We have a bit of code to support today that in > the fault injection code, but it'll need some cleanup and wiring it into > a knob here would be better. OK, good, I'm working on that. Looks like fault injection gives up if there are rpc's in process for the given client, whereas here I'd rather force the expiry. Looks like that needs some straightforward waitqueue logic to wait for the in-progress rpc's. --b.
On Apr 25, 2019, at 4:04 PM, J. Bruce Fields <bfields@redhat.com> wrote: > > From: "J. Bruce Fields" <bfields@redhat.com> > > The following patches expose information about NFSv4 opens held by knfsd > on behalf of NFSv4 clients. Those are currently invisible to userspace, > unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/). > > The approach is to add a new directory /proc/fs/nfsd/clients/ with > subdirectories for each active NFSv4 client. Each subdirectory has an > "info" file with some basic information to help identify the client and > an "opens" directory that lists the opens held by that client. > > I got it working by cobbling together some poorly-understood code I > found in libfs, rpc_pipefs and elsewhere. If anyone wants to wade in > and tell me what I've got wrong, they're more than welcome, but at this > stage I'm more curious for feedback on the interface. Is this in procfs, sysfs, or a separate NFSD-specific filesystem? My understanding is that "complex" files are verboten in procfs and sysfs? We've been going through a lengthy process to move files out of procfs into sysfs and debugfs as a result (while trying to maintain some kind of compatibility in the user tools), but if it is possible to use a separate filesystem to hold all of the stats/parameters I'd much rather do that than use debugfs (which has become root-access-only in newer kernels). Cheers, Andreas > I'm also cc'ing people responsible for lsof and util-linux in case they > have any opinions. > > Currently these pseudofiles look like: > > # find /proc/fs/nfsd/clients -type f|xargs tail > ==> /proc/fs/nfsd/clients/3741/opens <== > 5cc0cd36/6debfb50/00000001/00000001 rw -- fd:10:13649 'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef' > 5cc0cd36/6debfb50/00000003/00000001 r- -- fd:10:13650 'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef' > > ==> /proc/fs/nfsd/clients/3741/info <== > clientid: 6debfb505cc0cd36 > address: 192.168.122.36:0 > name: Linux NFSv4.2 test2.fieldses.org > minor version: 2 > > Each line of the "opens" file is tab-delimited and describes one open, > and the fields are stateid, open access bits, deny bits, > major:minor:ino, and open owner. > > So, some random questions: > > - I just copied the major:minor:ino thing from /proc/locks, I > suspect we would have picked something different to identify > inodes if /proc/locks were done now. (Mount id and inode? > Something else?) > > - The open owner is just an opaque blob of binary data, but > clients may choose to include some useful asci-encoded > information, so I'm formatting them as strings with non-ascii > stuff escaped. For example, pynfs usually uses the name of > the test as the open owner. But as you see above, the ascii > content of the Linux client's open owners is less useful. > Also, there's no way I know of to map them back to a file > description or process or anything else useful on the client, > so perhaps they're of limited interest. > > - I'm not sure about the stateid either. I did think it might > be useful just as a unique identifier for each line. > (Actually for that it'd be enough to take just the third of > those four numbers making up the stateid--maybe that would be > better.) > > In the "info" file, the "name" line is the client identifier/client > owner provided by the client, which (like the stateowner) is just opaque > binary data, though as you can see here the Linux client is providing a > readable ascii string. > > There's probably a lot more we could add to that info file eventually. > > Other stuff to add next: > > - nfsd/clients/#/kill that you can write to to revoke all a > client's state if it's wedged somehow. > - lists of locks and delegations; lower priority since most of > that information is already in /proc/locks. > > --b. > > J. Bruce Fields (10): > nfsd: persist nfsd filesystem across mounts > nfsd: rename cl_refcount > nfsd4: use reference count to free client > nfsd: add nfsd/clients directory > nfsd: make client/ directory names small ints > rpc: replace rpc_filelist by tree_descr > nfsd4: add a client info file > nfsd4: add file to display list of client's opens > nfsd: expose some more information about NFSv4 opens > nfsd: add more information to client info file > > fs/nfsd/netns.h | 6 + > fs/nfsd/nfs4state.c | 228 ++++++++++++++++++++++++++++++--- > fs/nfsd/nfsctl.c | 225 +++++++++++++++++++++++++++++++- > fs/nfsd/nfsd.h | 11 ++ > fs/nfsd/state.h | 9 +- > fs/seq_file.c | 17 +++ > include/linux/seq_file.h | 2 + > include/linux/string_helpers.h | 1 + > lib/string_helpers.c | 5 +- > net/sunrpc/rpc_pipe.c | 37 ++---- > 10 files changed, 491 insertions(+), 50 deletions(-) > > -- > 2.20.1 > Cheers, Andreas
On Thu, Apr 25 2019, Andreas Dilger wrote: > On Apr 25, 2019, at 4:04 PM, J. Bruce Fields <bfields@redhat.com> wrote: >> >> From: "J. Bruce Fields" <bfields@redhat.com> >> >> The following patches expose information about NFSv4 opens held by knfsd >> on behalf of NFSv4 clients. Those are currently invisible to userspace, >> unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/). >> >> The approach is to add a new directory /proc/fs/nfsd/clients/ with >> subdirectories for each active NFSv4 client. Each subdirectory has an >> "info" file with some basic information to help identify the client and >> an "opens" directory that lists the opens held by that client. >> >> I got it working by cobbling together some poorly-understood code I >> found in libfs, rpc_pipefs and elsewhere. If anyone wants to wade in >> and tell me what I've got wrong, they're more than welcome, but at this >> stage I'm more curious for feedback on the interface. > > Is this in procfs, sysfs, or a separate NFSD-specific filesystem? > My understanding is that "complex" files are verboten in procfs and sysfs? > We've been going through a lengthy process to move files out of procfs > into sysfs and debugfs as a result (while trying to maintain some kind of > compatibility in the user tools), but if it is possible to use a separate > filesystem to hold all of the stats/parameters I'd much rather do that > than use debugfs (which has become root-access-only in newer kernels). /proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific filesystem, originally created to replace the nfsd-specific systemcall. So the nfsd developers have a fair degree of latitude as to what can go in there. But I *don't* think it is a good idea to follow this pattern. Creating a separate control filesystem for every different module that thinks it has different needs doesn't scale well. We could end up with dozens of tiny filesystems that all need to be mounted at just the right place. I don't think that is healthy for Linus. Nor do I think we should be stuffing stuff into debugfs that isn't really for debugging. That isn't healthy either. If sysfs doesn't meet our needs, then we need to raise that in appropriate fora and present a clear case and try to build consensus - because if we see a problem, then it is likely that others do to. This is all presumably in the context of Lustre and while lustre is out-of-tree we don't have a lot of leverage. So I wouldn't consider pursuing anything here until we get back upstream. NeilBrown
> On Apr 26, 2019, at 1:20 AM, NeilBrown <neilb@suse.com> wrote: > > On Thu, Apr 25 2019, Andreas Dilger wrote: > >> On Apr 25, 2019, at 4:04 PM, J. Bruce Fields <bfields@redhat.com> wrote: >>> >>> From: "J. Bruce Fields" <bfields@redhat.com> >>> >>> The following patches expose information about NFSv4 opens held by knfsd >>> on behalf of NFSv4 clients. Those are currently invisible to userspace, >>> unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/). >>> >>> The approach is to add a new directory /proc/fs/nfsd/clients/ with >>> subdirectories for each active NFSv4 client. Each subdirectory has an >>> "info" file with some basic information to help identify the client and >>> an "opens" directory that lists the opens held by that client. >>> >>> I got it working by cobbling together some poorly-understood code I >>> found in libfs, rpc_pipefs and elsewhere. If anyone wants to wade in >>> and tell me what I've got wrong, they're more than welcome, but at this >>> stage I'm more curious for feedback on the interface. >> >> Is this in procfs, sysfs, or a separate NFSD-specific filesystem? >> My understanding is that "complex" files are verboten in procfs and sysfs? >> We've been going through a lengthy process to move files out of procfs >> into sysfs and debugfs as a result (while trying to maintain some kind of >> compatibility in the user tools), but if it is possible to use a separate >> filesystem to hold all of the stats/parameters I'd much rather do that >> than use debugfs (which has become root-access-only in newer kernels). > > /proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific > filesystem, originally created to replace the nfsd-specific systemcall. > So the nfsd developers have a fair degree of latitude as to what can go > in there. > > But I *don't* think it is a good idea to follow this pattern. Creating > a separate control filesystem for every different module that thinks it > has different needs doesn't scale well. We could end up with dozens of > tiny filesystems that all need to be mounted at just the right place. I > don't think that is healthy for Linus. > > Nor do I think we should be stuffing stuff into debugfs that isn't > really for debugging. That isn't healthy either. > > If sysfs doesn't meet our needs, then we need to raise that in > appropriate fora and present a clear case and try to build consensus - > because if we see a problem, then it is likely that others do to. I definitely *do* see the restrictions sysfs as being a problem, and I'd guess NFS developers thought the same, since the "one value per file" paradigm means that any kind of complex data needs to be split over hundreds or thousands of files, which is very inefficient for userspace to use. Consider if /proc/slabinfo had to follow the sysfs paradigm, this would (on my system) need about 225 directories (one per slab) and 3589 separate files in total (one per value) that would need to be read every second to implement "slabtop". Running strace on "top" shows it taking 0.25s wall time to open and read the files for only 350 processes on my system, at 2 files per process ("stat" and "statm"), and those have 44 and 7 values, respectively, so if it had to follow the sysfs paradigm would make this far worse. I think it would make a lot more sense to have one file per item of interest, and make it e.g. a well-structured YAML format ("name: value", with indentation denoting a hierarchy/grouping of related items) so that it can be both human and machine readable, easily parsed by scripts using bash or awk, rather than having an explicit directory+file hierarchy. Files like /proc/meminfo and /proc/<pid>/status are already YAML-formatted (or almost so), so it isn't ugly like XML encoding. > This is all presumably in the context of Lustre and while lustre is > out-of-tree we don't have a lot of leverage. So I wouldn't consider > pursuing anything here until we get back upstream. Sure, except that is a catch-22. We can't discuss what is needed until the code is in the kernel, but we can't get it into the kernel until the files it puts in /proc have been moved into /sys? Cheers, Andreas
On Fri, Apr 26, 2019 at 01:00:19PM +0200, Andreas Dilger wrote: > > On Apr 26, 2019, at 1:20 AM, NeilBrown <neilb@suse.com> wrote: > > /proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific > > filesystem, originally created to replace the nfsd-specific systemcall. > > So the nfsd developers have a fair degree of latitude as to what can go > > in there. > > > > But I *don't* think it is a good idea to follow this pattern. Creating > > a separate control filesystem for every different module that thinks it > > has different needs doesn't scale well. We could end up with dozens of > > tiny filesystems that all need to be mounted at just the right place. Aren't we already there? My laptop, Fedora 29 with everything pretty much default: $ findmnt -n -oFSTYPE|sort|uniq -c 1 autofs 1 bpf 11 cgroup 1 cgroup2 1 configfs 1 debugfs 1 devpts 1 devtmpfs 3 ext4 1 fusectl 1 fuse.gvfsd-fuse 1 hugetlbfs 1 mqueue 1 proc 1 pstore 1 rpc_pipefs 1 securityfs 1 selinuxfs 1 sysfs 5 tmpfs > > I don't think that is healthy for Linus. What are the problems you see? > > Nor do I think we should be stuffing stuff into debugfs that isn't > > really for debugging. That isn't healthy either. > > > > If sysfs doesn't meet our needs, then we need to raise that in > > appropriate fora and present a clear case and try to build consensus - > > because if we see a problem, then it is likely that others do to. > > I definitely *do* see the restrictions sysfs as being a problem, and I'd > guess NFS developers thought the same, For what it's worth, the nfsd filesystem predated sysfs, just barely. Looking at the history.... It was actually Al that introduced it in March 2002. Patrick Mochel added sysfs in October 2002. But it's true that from the start nfsd didn't really fit the model of a single (possibly writeable) attribute per file. (Might be interesting to look at the distribution of new filesystem types over time, there may have been a peak around then.) --b.
On Fri, Apr 26 2019, J. Bruce Fields wrote: > On Fri, Apr 26, 2019 at 01:00:19PM +0200, Andreas Dilger wrote: >> >> On Apr 26, 2019, at 1:20 AM, NeilBrown <neilb@suse.com> wrote: >> > /proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific >> > filesystem, originally created to replace the nfsd-specific systemcall. >> > So the nfsd developers have a fair degree of latitude as to what can go >> > in there. >> > >> > But I *don't* think it is a good idea to follow this pattern. Creating >> > a separate control filesystem for every different module that thinks it >> > has different needs doesn't scale well. We could end up with dozens of >> > tiny filesystems that all need to be mounted at just the right place. > > Aren't we already there? My laptop, Fedora 29 with everything pretty much > default: > > $ findmnt -n -oFSTYPE|sort|uniq -c > 1 autofs > 1 bpf > 11 cgroup > 1 cgroup2 > 1 configfs > 1 debugfs > 1 devpts > 1 devtmpfs > 3 ext4 > 1 fusectl > 1 fuse.gvfsd-fuse > 1 hugetlbfs > 1 mqueue > 1 proc > 1 pstore > 1 rpc_pipefs > 1 securityfs > 1 selinuxfs > 1 sysfs > 5 tmpfs Sometimes I think "Linux is so full of crap, it hardly matters if more crap is added". Other times I feel a bit more social responsibility and want to fight against adding too much more crap. > >> > I don't think that is healthy for Linus. ^^^^^ oops > > What are the problems you see? Fragmentation. This is exactly the platform problem. People look at the existing functionality, decide that it doesn't quite meet their needs, and then do an implement something that is mostly the same but just different enough so that they feel more comfortable. We have 'seq_file' which encourages people to create line-oriented info files, but if some have tab-separate lines, others have CSV, others have yaml etc, then there is plenty of room for confusion. If we could, instead, just agreed that more structured data actually can make sense in sysfs, and come up with a coherent approach that we can all tolerate, then life would be much easier. > >> > Nor do I think we should be stuffing stuff into debugfs that isn't >> > really for debugging. That isn't healthy either. >> > >> > If sysfs doesn't meet our needs, then we need to raise that in >> > appropriate fora and present a clear case and try to build consensus - >> > because if we see a problem, then it is likely that others do to. >> >> I definitely *do* see the restrictions sysfs as being a problem, and I'd >> guess NFS developers thought the same, > > For what it's worth, the nfsd filesystem predated sysfs, just barely. > > Looking at the history.... It was actually Al that introduced it in March > 2002. Patrick Mochel added sysfs in October 2002. IIRC the nfsd filesystem was added to address the difficulty of supporting a systemcall in a module. That basically doesn't work, so there needed to be some clean interface between some minimal nfsdctl systemcall code that was always compiled in, and the actually implementation that could be in a module. It was Viro who particularly noticed this need, so it was a filesystem that was chosen to fill the need - filesystems in modules was already a solved problem. The system call was (I think) write-only (no non-trivial return data) and each file just recieved a binary struct that matches the syscall argument. Once that was in place, it became a dumping ground for whatever we wanted. sysfs was, I think, originally about power management. It exposed connections between devices more than the devices themselves. This allowed user-space to be able to power-things down when not in use, and to understand the dependencies. Since then it grew.... > > But it's true that from the start nfsd didn't really fit the model of a single > (possibly writeable) attribute per file. Depends on what you mean by that. Original files where write-only and where slightly complex attributes. Writing performed an action, like adding an entry to the export table (first you add a client, then add a client+filesystem to export it). This idea for a file performing an action, rather than presenting an attribute, is much the same as the "bind" and "unbind" files you can find in sysfs. (see also https://lwn.net/Articles/378884/ for examples of sysfs files that are not one-attribute-per-file) NeilBrown > > (Might be interesting to look at the distribution of new filesystem types over > time, there may have been a peak around then.) > > --b.
On Fri, Apr 26 2019, Andreas Dilger wrote: >> On Apr 26, 2019, at 1:20 AM, NeilBrown <neilb@suse.com> wrote: >> >> On Thu, Apr 25 2019, Andreas Dilger wrote: >> >>> On Apr 25, 2019, at 4:04 PM, J. Bruce Fields <bfields@redhat.com> wrote: >>>> >>>> From: "J. Bruce Fields" <bfields@redhat.com> >>>> >>>> The following patches expose information about NFSv4 opens held by knfsd >>>> on behalf of NFSv4 clients. Those are currently invisible to userspace, >>>> unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/). >>>> >>>> The approach is to add a new directory /proc/fs/nfsd/clients/ with >>>> subdirectories for each active NFSv4 client. Each subdirectory has an >>>> "info" file with some basic information to help identify the client and >>>> an "opens" directory that lists the opens held by that client. >>>> >>>> I got it working by cobbling together some poorly-understood code I >>>> found in libfs, rpc_pipefs and elsewhere. If anyone wants to wade in >>>> and tell me what I've got wrong, they're more than welcome, but at this >>>> stage I'm more curious for feedback on the interface. >>> >>> Is this in procfs, sysfs, or a separate NFSD-specific filesystem? >>> My understanding is that "complex" files are verboten in procfs and sysfs? >>> We've been going through a lengthy process to move files out of procfs >>> into sysfs and debugfs as a result (while trying to maintain some kind of >>> compatibility in the user tools), but if it is possible to use a separate >>> filesystem to hold all of the stats/parameters I'd much rather do that >>> than use debugfs (which has become root-access-only in newer kernels). >> >> /proc/fs/nfsd is the (standard) mount point for a separate NFSD-specific >> filesystem, originally created to replace the nfsd-specific systemcall. >> So the nfsd developers have a fair degree of latitude as to what can go >> in there. >> >> But I *don't* think it is a good idea to follow this pattern. Creating >> a separate control filesystem for every different module that thinks it >> has different needs doesn't scale well. We could end up with dozens of >> tiny filesystems that all need to be mounted at just the right place. I >> don't think that is healthy for Linus. >> >> Nor do I think we should be stuffing stuff into debugfs that isn't >> really for debugging. That isn't healthy either. >> >> If sysfs doesn't meet our needs, then we need to raise that in >> appropriate fora and present a clear case and try to build consensus - >> because if we see a problem, then it is likely that others do to. > > I definitely *do* see the restrictions sysfs as being a problem, and I'd > guess NFS developers thought the same, since the "one value per file" > paradigm means that any kind of complex data needs to be split over > hundreds or thousands of files, which is very inefficient for userspace to > use. Consider if /proc/slabinfo had to follow the sysfs paradigm, this would > (on my system) need about 225 directories (one per slab) and 3589 separate > files in total (one per value) that would need to be read every second to > implement "slabtop". Running strace on "top" shows it taking 0.25s wall time > to open and read the files for only 350 processes on my system, at 2 files > per process ("stat" and "statm"), and those have 44 and 7 values, respectively, > so if it had to follow the sysfs paradigm would make this far worse. > > I think it would make a lot more sense to have one file per item of interest, > and make it e.g. a well-structured YAML format ("name: value", with indentation > denoting a hierarchy/grouping of related items) so that it can be both human > and machine readable, easily parsed by scripts using bash or awk, rather than > having an explicit directory+file hierarchy. Files like /proc/meminfo and > /proc/<pid>/status are already YAML-formatted (or almost so), so it isn't ugly > like XML encoding. So what are your pain points? What data do you really want to present in a structured file? Look at /proc/self/mountstats on some machine which has an NFS mount. There would be no problem adding similar information for lustre mounts. What data do you want to export to user-space, which wouldn't fit there and doesn't fit the one-value-per-file model. To make a case, we need concrete data. > >> This is all presumably in the context of Lustre and while lustre is >> out-of-tree we don't have a lot of leverage. So I wouldn't consider >> pursuing anything here until we get back upstream. > > Sure, except that is a catch-22. We can't discuss what is needed until > the code is in the kernel, but we can't get it into the kernel until the > files it puts in /proc have been moved into /sys? Or maybe just removed. If lustre is usable without some of these files, then we can land lustre without them, and then start the conversation about how to export the data that we want exported. NeilBrown
On Sat, Apr 27, 2019 at 09:55:23AM +1000, NeilBrown wrote: > On Fri, Apr 26 2019, J. Bruce Fields wrote: > > But it's true that from the start nfsd didn't really fit the model > > of a single (possibly writeable) attribute per file. > > Depends on what you mean by that. Original files where write-only and > where slightly complex attributes. Yes I thought it was just those too, but then I looked at the original commit it also included at least the "exports" file. > Writing performed an action, like > adding an entry to the export table (first you add a client, then add a > client+filesystem to export it). > > This idea for a file performing an action, rather than presenting an > attribute, is much the same as the "bind" and "unbind" files you can > find in sysfs. > > (see also https://lwn.net/Articles/378884/ for examples of sysfs files > that are not one-attribute-per-file) I'll give that a re-read, thanks. I did spend maybe a few minutes looking into basing nfsd code on kernfs and didn't think it was worth it. I could take a more serious look. --b.
On Sat, Apr 27 2019, J. Bruce Fields wrote: > On Sat, Apr 27, 2019 at 09:55:23AM +1000, NeilBrown wrote: >> On Fri, Apr 26 2019, J. Bruce Fields wrote: >> > But it's true that from the start nfsd didn't really fit the model >> > of a single (possibly writeable) attribute per file. >> >> Depends on what you mean by that. Original files where write-only and >> where slightly complex attributes. > > Yes I thought it was just those too, but then I looked at the original > commit it also included at least the "exports" file. Maybe it depends on how one chooses to interpret history - never an exact science. The "exports" file pre-existed the nfsd filesystem - it was (and still is) a file in procfs: /proc/fs/nfs/exports. So the nfsd filesystem was not created to provide that file. It was created to replace a systemcall. As I said, it subsequently had a variety of things added to it. exports was just the first of these. At least, that is how I choose to see it. > >> Writing performed an action, like >> adding an entry to the export table (first you add a client, then add a >> client+filesystem to export it). >> >> This idea for a file performing an action, rather than presenting an >> attribute, is much the same as the "bind" and "unbind" files you can >> find in sysfs. >> >> (see also https://lwn.net/Articles/378884/ for examples of sysfs files >> that are not one-attribute-per-file) > > I'll give that a re-read, thanks. > > I did spend maybe a few minutes looking into basing nfsd code on kernfs > and didn't think it was worth it. I could take a more serious look. I think that in your use-case it make lots of sense to have a structured file for the "opens" (similar to /proc/locks and /proc/mounts). The "info" could reasonably be several attribute files (clientid, address, name, minor_version), but I don't think it benefits anyone for 'opens' to be a directory full of directories each with a file for each of the fields. So using kernfs would mean pushing for allowing structured files in kernfs. I'd be happy to support that, but I think you would need to go into it convinced that you really wanted to persist. I think using kernfs is mostly about embedding a kobject in everything, then setting an appropriate ktype with appropriate attributes. Not particularly complex, but certainly a bit of work. Thanks, NeilBrown
From: "J. Bruce Fields" <bfields@redhat.com> The following patches expose information about NFSv4 opens held by knfsd on behalf of NFSv4 clients. Those are currently invisible to userspace, unlike locks (/proc/locks) and local proccesses' opens (/proc/<pid>/). The approach is to add a new directory /proc/fs/nfsd/clients/ with subdirectories for each active NFSv4 client. Each subdirectory has an "info" file with some basic information to help identify the client and an "opens" directory that lists the opens held by that client. I got it working by cobbling together some poorly-understood code I found in libfs, rpc_pipefs and elsewhere. If anyone wants to wade in and tell me what I've got wrong, they're more than welcome, but at this stage I'm more curious for feedback on the interface. I'm also cc'ing people responsible for lsof and util-linux in case they have any opinions. Currently these pseudofiles look like: # find /proc/fs/nfsd/clients -type f|xargs tail ==> /proc/fs/nfsd/clients/3741/opens <== 5cc0cd36/6debfb50/00000001/00000001 rw -- fd:10:13649 'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef' 5cc0cd36/6debfb50/00000003/00000001 r- -- fd:10:13650 'open id:\x00\x00\x00&\x00\x00\x00\x00\x00\x00\x0b\xb7\x89%\xfc\xef' ==> /proc/fs/nfsd/clients/3741/info <== clientid: 6debfb505cc0cd36 address: 192.168.122.36:0 name: Linux NFSv4.2 test2.fieldses.org minor version: 2 Each line of the "opens" file is tab-delimited and describes one open, and the fields are stateid, open access bits, deny bits, major:minor:ino, and open owner. So, some random questions: - I just copied the major:minor:ino thing from /proc/locks, I suspect we would have picked something different to identify inodes if /proc/locks were done now. (Mount id and inode? Something else?) - The open owner is just an opaque blob of binary data, but clients may choose to include some useful asci-encoded information, so I'm formatting them as strings with non-ascii stuff escaped. For example, pynfs usually uses the name of the test as the open owner. But as you see above, the ascii content of the Linux client's open owners is less useful. Also, there's no way I know of to map them back to a file description or process or anything else useful on the client, so perhaps they're of limited interest. - I'm not sure about the stateid either. I did think it might be useful just as a unique identifier for each line. (Actually for that it'd be enough to take just the third of those four numbers making up the stateid--maybe that would be better.) In the "info" file, the "name" line is the client identifier/client owner provided by the client, which (like the stateowner) is just opaque binary data, though as you can see here the Linux client is providing a readable ascii string. There's probably a lot more we could add to that info file eventually. Other stuff to add next: - nfsd/clients/#/kill that you can write to to revoke all a client's state if it's wedged somehow. - lists of locks and delegations; lower priority since most of that information is already in /proc/locks. --b. J. Bruce Fields (10): nfsd: persist nfsd filesystem across mounts nfsd: rename cl_refcount nfsd4: use reference count to free client nfsd: add nfsd/clients directory nfsd: make client/ directory names small ints rpc: replace rpc_filelist by tree_descr nfsd4: add a client info file nfsd4: add file to display list of client's opens nfsd: expose some more information about NFSv4 opens nfsd: add more information to client info file fs/nfsd/netns.h | 6 + fs/nfsd/nfs4state.c | 228 ++++++++++++++++++++++++++++++--- fs/nfsd/nfsctl.c | 225 +++++++++++++++++++++++++++++++- fs/nfsd/nfsd.h | 11 ++ fs/nfsd/state.h | 9 +- fs/seq_file.c | 17 +++ include/linux/seq_file.h | 2 + include/linux/string_helpers.h | 1 + lib/string_helpers.c | 5 +- net/sunrpc/rpc_pipe.c | 37 ++---- 10 files changed, 491 insertions(+), 50 deletions(-)