diff mbox series

[v6,17/18] nfs: add Documentation/filesystems/nfs/localio.rst

Message ID 20240619204032.93740-18-snitzer@kernel.org (mailing list archive)
State New
Headers show
Series nfs/nfsd: add support for localio | expand

Commit Message

Mike Snitzer June 19, 2024, 8:40 p.m. UTC
This document gives an overview of the LOCALIO auxiliary RPC protocol
added to the Linux NFS client and server (both v3 and v4) to allow a
client and server to reliably handshake to determine if they are on the
same host.  The LOCALIO auxiliary protocol's implementation, which uses
the same connection as NFS traffic, follows the pattern established by
the NFS ACL protocol extension.

The robust handshake between local client and server is just the
beginning, the ultimate usecase this locality makes possible is the
client is able to issue reads, writes and commits directly to the server
without having to go over the network.  This is particularly useful for
container usecases (e.g. kubernetes) where it is possible to run an IO
job local to the server.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
 include/linux/nfslocalio.h                |   2 +
 2 files changed, 150 insertions(+)
 create mode 100644 Documentation/filesystems/nfs/localio.rst

Comments

Chuck Lever June 20, 2024, 1:52 p.m. UTC | #1
On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
> This document gives an overview of the LOCALIO auxiliary RPC protocol
> added to the Linux NFS client and server (both v3 and v4) to allow a
> client and server to reliably handshake to determine if they are on the
> same host.  The LOCALIO auxiliary protocol's implementation, which uses
> the same connection as NFS traffic, follows the pattern established by
> the NFS ACL protocol extension.
> 
> The robust handshake between local client and server is just the
> beginning, the ultimate usecase this locality makes possible is the
> client is able to issue reads, writes and commits directly to the server
> without having to go over the network.  This is particularly useful for
> container usecases (e.g. kubernetes) where it is possible to run an IO
> job local to the server.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
>  include/linux/nfslocalio.h                |   2 +
>  2 files changed, 150 insertions(+)
>  create mode 100644 Documentation/filesystems/nfs/localio.rst
> 
> diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
> new file mode 100644
> index 000000000000..a43c3dab2cab
> --- /dev/null
> +++ b/Documentation/filesystems/nfs/localio.rst
> @@ -0,0 +1,148 @@
> +===========
> +NFS localio
> +===========
> +
> +This document gives an overview of the LOCALIO auxiliary RPC protocol
> +added to the Linux NFS client and server (both v3 and v4) to allow a
> +client and server to reliably handshake to determine if they are on the
> +same host.  The LOCALIO auxiliary protocol's implementation, which uses
> +the same connection as NFS traffic, follows the pattern established by
> +the NFS ACL protocol extension.
> +
> +The LOCALIO auxiliary protocol is needed to allow robust discovery of
> +clients local to their servers.  Prior to this LOCALIO protocol a
> +fragile sockaddr network address based match against all local network
> +interfaces was attempted.  But unlike the LOCALIO protocol, the
> +sockaddr-based matching didn't handle use of iptables or containers.
> +
> +The robust handshake between local client and server is just the
> +beginning, the ultimate usecase this locality makes possible is the
> +client is able to issue reads, writes and commits directly to the server
> +without having to go over the network.  This is particularly useful for
> +container usecases (e.g. kubernetes) where it is possible to run an IO
> +job local to the server.
> +
> +The performance advantage realized from localio's ability to bypass
> +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
> +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
> +-  With localio:
> +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
> +-  Without localio:
> +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
> +
> +RPC
> +---
> +
> +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
> +method that allows the Linux nfs client to retrieve a Linux nfs server's
> +uuid.  This protocol isn't part of an IETF standard, nor does it need to
> +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
> +to an implementation detail.
> +
> +The GETUUID method encodes the server's uuid_t in terms of the fixed
> +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
> +methods are used instead of the less efficient variable sized methods.
> +
> +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
> +as 0x20000002 (but a request for a unique RPC program number assignment
> +has been submitted to IANA.org).
> +
> +The following approximately describes the LOCALIO in a pseudo rpcgen .x
> +syntax:
> +
> +#define UUID_SIZE 16
> +typedef u8 uuid_t<UUID_SIZE>;
> +
> +program NFS_LOCALIO_PROGRAM {
> +     version NULLVERS {
> +        void NULL(void) = 0;
> +	} = 1;
> +     version GETUUIDVERS {
> +        uuid_t GETUUID(void) = 1;
> +	} = 1;
> +} = 0x20000002;
> +
> +The above is the skeleton for the LOCALIO protocol, it doesn't account
> +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
> +is used to implement GETUUID.
> +
> +Here are the respective XDR results for nfsd and nfs:

Hi Mike!

A protocol spec describes the on-the-wire data formats, not the
in-memory structure layouts. The below C structures are not
relevant to this specification. This should be all you need here,
if I understand your protocol correctly:

/* raw RFC 9562 UUID */
#define UUID_SIZE 16
typedef u8 uuid_t<UUID_SIZE>;

union GETUUID1res switch (uint32 status) {
case 0:
    uuid_t  uuid;
default:
    void;
};

program NFS_LOCALIO_PROGRAM {
    version LOCALIO_V1 {
        void
            NULL(void) = 0;

        GETUUID1res
            GETUUID(void) = 1;
    } = 1;
} = 0x20000002;

Then you need to discuss transport considerations:

- Whether this protocol is registered with the server's rpcbind
  service,
- Which TCP/UDP port number does it use? Assuming 2049, and that
  it will appear on the same transport connection as NFS traffic
  (just like NFACL).

Should it be supported on port 20049 with RDMA as well?


> +
> +From fs/nfsd/xdr.h
> +
> +struct nfsd_getuuidres {
> +	__be32			status;
> +	uuid_t			uuid;
> +};
> +
> +From include/linux/nfs_xdr.h
> +
> +struct nfs_getuuidres {
> +	__u8 *			uuid;
> +};
> +
> +The client ultimately decodes this XDR using:
> +xdr_stream_decode_opaque_fixed(xdr, result->uuid, UUID_SIZE);
> +
> +NFS Common and Server
> +---------------------
> +
> +First use is in nfsd, to add access to a global nfsd_uuids list in
> +nfs_common that is used to register and then identify local nfsd
> +instances.
> +
> +nfsd_uuids is protected by the nfsd_mutex or RCU read lock and is
> +composed of nfsd_uuid_t instances that are managed as nfsd creates them
> +(per network namespace).
> +
> +nfsd_uuid_is_local() and nfsd_uuid_lookup() are used to search all local
> +nfsd for the client specified nfsd uuid.
> +
> +The nfsd_uuids list is the basis for localio enablement, as such it has
> +members that point to nfsd memory for direct use by the client
> +(e.g. 'net' is the server's network namespace, through it the client can
> +access nn->nfsd_serv with proper rcu read access).  It is this client
> +and server synchronization that enables advanced usage and lifetime of
> +objects to span from the host kernel's nfsd to per-container knfsd
> +instances that are connected to nfs client's running on the same local
> +host.
> +
> +NFS Client
> +----------
> +
> +fs/nfs/localio.c:nfs_local_probe() will retrieve a server's uuid via
> +LOCALIO protocol and check if the server with that uuid is known to be
> +local.  This ensures client and server 1: support localio 2: are local
> +to each other.
> +
> +See fs/nfs/localio.c:nfs_local_open_fh() and
> +fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
> +focused use of nfsd_uuid_t struct to allow a client local to a server to
> +open a file pointer without needing to go over the network.
> +
> +The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
> +server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
> +both the nfsd network namespace and the associated nn->nfsd_serv in
> +terms of RCU.  If nfsd_open_local_fh() finds that client no longer sees
> +valid nfsd objects (be it struct net or nn->nfsd_serv) it return ENXIO
> +to nfs_local_open_fh() and the client will try to reestablish the
> +LOCALIO resources needed by calling nfs_local_probe() again.  This
> +recovery is needed if/when an nfsd instance running in a container were
> +to reboot while a localio client is connected to it.
> +
> +Testing
> +-------
> +
> +The LOCALIO auxiliary protocol and associated NFS localio read, right
> +and commit access have proven stable against various test scenarios but
> +these have not yet been formalized in any testsuite:

Is there anywhere that describes what is needed to set up clients
and a server to do local I/O? Then running the usual suite of NFS
tests on that set up and comparing the nfsstat output on the local
and remote clients should be a basic "smoke test" kind of thing
that maintainers can use as a check-in test.


> +
> +-  Client and server both on localhost (for both v3 and v4.2).
> +
> +-  Various permutations of client and server support enablement for
> +   both local and remote client and server.  Testing against NFS storage
> +   products that don't support the LOCALIO protocol was also performed.
> +
> +-  Client on host, server within a container (for both v3 and v4.2)
> +   The container testing was in terms of podman managed containers and
> +   includes container stop/restart scenario.
> diff --git a/include/linux/nfslocalio.h b/include/linux/nfslocalio.h
> index c9592ad0afe2..a9722e18b527 100644
> --- a/include/linux/nfslocalio.h
> +++ b/include/linux/nfslocalio.h
> @@ -20,6 +20,8 @@ extern struct list_head nfsd_uuids;
>   * Each nfsd instance has an nfsd_uuid_t that is accessible through the
>   * global nfsd_uuids list. Useful to allow a client to negotiate if localio
>   * possible with its server.
> + *
> + * See Documentation/filesystems/nfs/localio.rst for more detail.
>   */
>  typedef struct {
>  	uuid_t uuid;
> -- 
> 2.44.0
>
Mike Snitzer June 20, 2024, 2:33 p.m. UTC | #2
On Thu, Jun 20, 2024 at 09:52:21AM -0400, Chuck Lever wrote:
> On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
> > This document gives an overview of the LOCALIO auxiliary RPC protocol
> > added to the Linux NFS client and server (both v3 and v4) to allow a
> > client and server to reliably handshake to determine if they are on the
> > same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > the same connection as NFS traffic, follows the pattern established by
> > the NFS ACL protocol extension.
> > 
> > The robust handshake between local client and server is just the
> > beginning, the ultimate usecase this locality makes possible is the
> > client is able to issue reads, writes and commits directly to the server
> > without having to go over the network.  This is particularly useful for
> > container usecases (e.g. kubernetes) where it is possible to run an IO
> > job local to the server.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> >  Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
> >  include/linux/nfslocalio.h                |   2 +
> >  2 files changed, 150 insertions(+)
> >  create mode 100644 Documentation/filesystems/nfs/localio.rst
> > 
> > diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
> > new file mode 100644
> > index 000000000000..a43c3dab2cab
> > --- /dev/null
> > +++ b/Documentation/filesystems/nfs/localio.rst
> > @@ -0,0 +1,148 @@
> > +===========
> > +NFS localio
> > +===========
> > +
> > +This document gives an overview of the LOCALIO auxiliary RPC protocol
> > +added to the Linux NFS client and server (both v3 and v4) to allow a
> > +client and server to reliably handshake to determine if they are on the
> > +same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > +the same connection as NFS traffic, follows the pattern established by
> > +the NFS ACL protocol extension.
> > +
> > +The LOCALIO auxiliary protocol is needed to allow robust discovery of
> > +clients local to their servers.  Prior to this LOCALIO protocol a
> > +fragile sockaddr network address based match against all local network
> > +interfaces was attempted.  But unlike the LOCALIO protocol, the
> > +sockaddr-based matching didn't handle use of iptables or containers.
> > +
> > +The robust handshake between local client and server is just the
> > +beginning, the ultimate usecase this locality makes possible is the
> > +client is able to issue reads, writes and commits directly to the server
> > +without having to go over the network.  This is particularly useful for
> > +container usecases (e.g. kubernetes) where it is possible to run an IO
> > +job local to the server.
> > +
> > +The performance advantage realized from localio's ability to bypass
> > +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
> > +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
> > +-  With localio:
> > +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
> > +-  Without localio:
> > +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
> > +
> > +RPC
> > +---
> > +
> > +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
> > +method that allows the Linux nfs client to retrieve a Linux nfs server's
> > +uuid.  This protocol isn't part of an IETF standard, nor does it need to
> > +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
> > +to an implementation detail.
> > +
> > +The GETUUID method encodes the server's uuid_t in terms of the fixed
> > +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
> > +methods are used instead of the less efficient variable sized methods.
> > +
> > +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
> > +as 0x20000002 (but a request for a unique RPC program number assignment
> > +has been submitted to IANA.org).
> > +
> > +The following approximately describes the LOCALIO in a pseudo rpcgen .x
> > +syntax:
> > +
> > +#define UUID_SIZE 16
> > +typedef u8 uuid_t<UUID_SIZE>;
> > +
> > +program NFS_LOCALIO_PROGRAM {
> > +     version NULLVERS {
> > +        void NULL(void) = 0;
> > +	} = 1;
> > +     version GETUUIDVERS {
> > +        uuid_t GETUUID(void) = 1;
> > +	} = 1;
> > +} = 0x20000002;
> > +
> > +The above is the skeleton for the LOCALIO protocol, it doesn't account
> > +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
> > +is used to implement GETUUID.
> > +
> > +Here are the respective XDR results for nfsd and nfs:
> 
> Hi Mike!
> 
> A protocol spec describes the on-the-wire data formats, not the
> in-memory structure layouts. The below C structures are not
> relevant to this specification. This should be all you need here,
> if I understand your protocol correctly:
> 
> /* raw RFC 9562 UUID */
> #define UUID_SIZE 16
> typedef u8 uuid_t<UUID_SIZE>;
> 
> union GETUUID1res switch (uint32 status) {
> case 0:
>     uuid_t  uuid;
> default:
>     void;
> };
> 
> program NFS_LOCALIO_PROGRAM {
>     version LOCALIO_V1 {
>         void
>             NULL(void) = 0;
> 
>         GETUUID1res
>             GETUUID(void) = 1;
>     } = 1;
> } = 0x20000002;

Thanks for this, nice to see I wasn't too far off.

> Then you need to discuss transport considerations:
> 
> - Whether this protocol is registered with the server's rpcbind
>   service,

It isn't, should it be?  Not familiar with what needs updating to do
it, but happy to work through it.

> - Which TCP/UDP port number does it use? Assuming 2049, and that
>   it will appear on the same transport connection as NFS traffic
>   (just like NFACL).

Correct.
 
> Should it be supported on port 20049 with RDMA as well?

Unless there is some additional code needed, I don't see why it
wouldn't.  But I haven't tested it (will look at NFS's RDMA support
and wrap my head around it).

> > +Testing
> > +-------
> > +
> > +The LOCALIO auxiliary protocol and associated NFS localio read, right
> > +and commit access have proven stable against various test scenarios but
> > +these have not yet been formalized in any testsuite:
> 
> Is there anywhere that describes what is needed to set up clients
> and a server to do local I/O? Then running the usual suite of NFS
> tests on that set up and comparing the nfsstat output on the local
> and remote clients should be a basic "smoke test" kind of thing
> that maintainers can use as a check-in test.

I just figured running nfsd and nfs client connecting to that
localhost was obvious.  But I can fill in more howto like info in this
section.

What is "the usual suite of NFS tests"?  I should run them ;)

(apologies if there is well established docs with pointers, still
learning to fish, thanks for your help!)

Mike
Chuck Lever June 20, 2024, 2:45 p.m. UTC | #3
On Thu, Jun 20, 2024 at 10:33:15AM -0400, Mike Snitzer wrote:
> On Thu, Jun 20, 2024 at 09:52:21AM -0400, Chuck Lever wrote:
> > On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
> > > This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > added to the Linux NFS client and server (both v3 and v4) to allow a
> > > client and server to reliably handshake to determine if they are on the
> > > same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > the same connection as NFS traffic, follows the pattern established by
> > > the NFS ACL protocol extension.
> > > 
> > > The robust handshake between local client and server is just the
> > > beginning, the ultimate usecase this locality makes possible is the
> > > client is able to issue reads, writes and commits directly to the server
> > > without having to go over the network.  This is particularly useful for
> > > container usecases (e.g. kubernetes) where it is possible to run an IO
> > > job local to the server.
> > > 
> > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > ---
> > >  Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
> > >  include/linux/nfslocalio.h                |   2 +
> > >  2 files changed, 150 insertions(+)
> > >  create mode 100644 Documentation/filesystems/nfs/localio.rst
> > > 
> > > diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
> > > new file mode 100644
> > > index 000000000000..a43c3dab2cab
> > > --- /dev/null
> > > +++ b/Documentation/filesystems/nfs/localio.rst
> > > @@ -0,0 +1,148 @@
> > > +===========
> > > +NFS localio
> > > +===========
> > > +
> > > +This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > +added to the Linux NFS client and server (both v3 and v4) to allow a
> > > +client and server to reliably handshake to determine if they are on the
> > > +same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > +the same connection as NFS traffic, follows the pattern established by
> > > +the NFS ACL protocol extension.
> > > +
> > > +The LOCALIO auxiliary protocol is needed to allow robust discovery of
> > > +clients local to their servers.  Prior to this LOCALIO protocol a
> > > +fragile sockaddr network address based match against all local network
> > > +interfaces was attempted.  But unlike the LOCALIO protocol, the
> > > +sockaddr-based matching didn't handle use of iptables or containers.
> > > +
> > > +The robust handshake between local client and server is just the
> > > +beginning, the ultimate usecase this locality makes possible is the
> > > +client is able to issue reads, writes and commits directly to the server
> > > +without having to go over the network.  This is particularly useful for
> > > +container usecases (e.g. kubernetes) where it is possible to run an IO
> > > +job local to the server.
> > > +
> > > +The performance advantage realized from localio's ability to bypass
> > > +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
> > > +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
> > > +-  With localio:
> > > +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
> > > +-  Without localio:
> > > +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
> > > +
> > > +RPC
> > > +---
> > > +
> > > +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
> > > +method that allows the Linux nfs client to retrieve a Linux nfs server's
> > > +uuid.  This protocol isn't part of an IETF standard, nor does it need to
> > > +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
> > > +to an implementation detail.
> > > +
> > > +The GETUUID method encodes the server's uuid_t in terms of the fixed
> > > +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
> > > +methods are used instead of the less efficient variable sized methods.
> > > +
> > > +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
> > > +as 0x20000002 (but a request for a unique RPC program number assignment
> > > +has been submitted to IANA.org).
> > > +
> > > +The following approximately describes the LOCALIO in a pseudo rpcgen .x
> > > +syntax:
> > > +
> > > +#define UUID_SIZE 16
> > > +typedef u8 uuid_t<UUID_SIZE>;
> > > +
> > > +program NFS_LOCALIO_PROGRAM {
> > > +     version NULLVERS {
> > > +        void NULL(void) = 0;
> > > +	} = 1;
> > > +     version GETUUIDVERS {
> > > +        uuid_t GETUUID(void) = 1;
> > > +	} = 1;
> > > +} = 0x20000002;
> > > +
> > > +The above is the skeleton for the LOCALIO protocol, it doesn't account
> > > +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
> > > +is used to implement GETUUID.
> > > +
> > > +Here are the respective XDR results for nfsd and nfs:
> > 
> > Hi Mike!
> > 
> > A protocol spec describes the on-the-wire data formats, not the
> > in-memory structure layouts. The below C structures are not
> > relevant to this specification. This should be all you need here,
> > if I understand your protocol correctly:
> > 
> > /* raw RFC 9562 UUID */
> > #define UUID_SIZE 16
> > typedef u8 uuid_t<UUID_SIZE>;
> > 
> > union GETUUID1res switch (uint32 status) {
> > case 0:
> >     uuid_t  uuid;
> > default:
> >     void;
> > };
> > 
> > program NFS_LOCALIO_PROGRAM {
> >     version LOCALIO_V1 {
> >         void
> >             NULL(void) = 0;
> > 
> >         GETUUID1res
> >             GETUUID(void) = 1;
> >     } = 1;
> > } = 0x20000002;
> 
> Thanks for this, nice to see I wasn't too far off.
> 
> > Then you need to discuss transport considerations:
> > 
> > - Whether this protocol is registered with the server's rpcbind
> >   service,
> 
> It isn't, should it be?  Not familiar with what needs updating to do
> it, but happy to work through it.

Well the issue is whether a client can assume that LOCALIO will
always be running on a fixed port. Which, IIUC, it will be. So I
don't think registration is needed. The protocol spec needs to
state that the LOCALIO server port is fixed, and that makes
rpcbind registration optional.


> > - Which TCP/UDP port number does it use? Assuming 2049, and that
> >   it will appear on the same transport connection as NFS traffic
> >   (just like NFACL).
> 
> Correct.
>  
> > Should it be supported on port 20049 with RDMA as well?
> 
> Unless there is some additional code needed, I don't see why it
> wouldn't.  But I haven't tested it (will look at NFS's RDMA support
> and wrap my head around it).

Head-wrapping NFS/RDMA is a multi-year project :-) 

You probably do want to have LOCALIO available for NFS/RDMA
connections. I'm not sure that requires extra code. I don't recall
clearly, but I think there isn't anything extra done for NFSACL,
for example.


> > > +Testing
> > > +-------
> > > +
> > > +The LOCALIO auxiliary protocol and associated NFS localio read, right
> > > +and commit access have proven stable against various test scenarios but
> > > +these have not yet been formalized in any testsuite:
> > 
> > Is there anywhere that describes what is needed to set up clients
> > and a server to do local I/O? Then running the usual suite of NFS
> > tests on that set up and comparing the nfsstat output on the local
> > and remote clients should be a basic "smoke test" kind of thing
> > that maintainers can use as a check-in test.
> 
> I just figured running nfsd and nfs client connecting to that
> localhost was obvious.  But I can fill in more howto like info in this
> section.
> 
> What is "the usual suite of NFS tests"?  I should run them ;)

Start with the cthon04 suite. We all seem to use fstests too. There
are some others, but these should be sufficient for your purposes.
NeilBrown June 20, 2024, 10:12 p.m. UTC | #4
On Thu, 20 Jun 2024, Chuck Lever wrote:
> On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
> > This document gives an overview of the LOCALIO auxiliary RPC protocol
> > added to the Linux NFS client and server (both v3 and v4) to allow a
> > client and server to reliably handshake to determine if they are on the
> > same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > the same connection as NFS traffic, follows the pattern established by
> > the NFS ACL protocol extension.
> > 
> > The robust handshake between local client and server is just the
> > beginning, the ultimate usecase this locality makes possible is the
> > client is able to issue reads, writes and commits directly to the server
> > without having to go over the network.  This is particularly useful for
> > container usecases (e.g. kubernetes) where it is possible to run an IO
> > job local to the server.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> >  Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
> >  include/linux/nfslocalio.h                |   2 +
> >  2 files changed, 150 insertions(+)
> >  create mode 100644 Documentation/filesystems/nfs/localio.rst
> > 
> > diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
> > new file mode 100644
> > index 000000000000..a43c3dab2cab
> > --- /dev/null
> > +++ b/Documentation/filesystems/nfs/localio.rst
> > @@ -0,0 +1,148 @@
> > +===========
> > +NFS localio
> > +===========
> > +
> > +This document gives an overview of the LOCALIO auxiliary RPC protocol
> > +added to the Linux NFS client and server (both v3 and v4) to allow a
> > +client and server to reliably handshake to determine if they are on the
> > +same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > +the same connection as NFS traffic, follows the pattern established by
> > +the NFS ACL protocol extension.
> > +
> > +The LOCALIO auxiliary protocol is needed to allow robust discovery of
> > +clients local to their servers.  Prior to this LOCALIO protocol a
> > +fragile sockaddr network address based match against all local network
> > +interfaces was attempted.  But unlike the LOCALIO protocol, the
> > +sockaddr-based matching didn't handle use of iptables or containers.
> > +
> > +The robust handshake between local client and server is just the
> > +beginning, the ultimate usecase this locality makes possible is the
> > +client is able to issue reads, writes and commits directly to the server
> > +without having to go over the network.  This is particularly useful for
> > +container usecases (e.g. kubernetes) where it is possible to run an IO
> > +job local to the server.
> > +
> > +The performance advantage realized from localio's ability to bypass
> > +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
> > +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
> > +-  With localio:
> > +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
> > +-  Without localio:
> > +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
> > +
> > +RPC
> > +---
> > +
> > +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
> > +method that allows the Linux nfs client to retrieve a Linux nfs server's
> > +uuid.  This protocol isn't part of an IETF standard, nor does it need to
> > +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
> > +to an implementation detail.
> > +
> > +The GETUUID method encodes the server's uuid_t in terms of the fixed
> > +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
> > +methods are used instead of the less efficient variable sized methods.
> > +
> > +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
> > +as 0x20000002 (but a request for a unique RPC program number assignment
> > +has been submitted to IANA.org).
> > +
> > +The following approximately describes the LOCALIO in a pseudo rpcgen .x
> > +syntax:
> > +
> > +#define UUID_SIZE 16
> > +typedef u8 uuid_t<UUID_SIZE>;
> > +
> > +program NFS_LOCALIO_PROGRAM {
> > +     version NULLVERS {
> > +        void NULL(void) = 0;
> > +	} = 1;
> > +     version GETUUIDVERS {
> > +        uuid_t GETUUID(void) = 1;
> > +	} = 1;
> > +} = 0x20000002;
> > +
> > +The above is the skeleton for the LOCALIO protocol, it doesn't account
> > +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
> > +is used to implement GETUUID.
> > +
> > +Here are the respective XDR results for nfsd and nfs:
> 
> Hi Mike!
> 
> A protocol spec describes the on-the-wire data formats, not the
> in-memory structure layouts. The below C structures are not
> relevant to this specification. This should be all you need here,
> if I understand your protocol correctly:
> 
> /* raw RFC 9562 UUID */
> #define UUID_SIZE 16
> typedef u8 uuid_t<UUID_SIZE>;
> 
> union GETUUID1res switch (uint32 status) {

I don't think we need a status in the protocol.  GETUUID always returns
a UUID.  There is no possible error condition.

I don't think we need the NULL procedure either, but that is such a
deeply entrenched practice I won't argue the point.

> case 0:
>     uuid_t  uuid;
> default:
>     void;
> };
> 
> program NFS_LOCALIO_PROGRAM {
>     version LOCALIO_V1 {
>         void
>             NULL(void) = 0;
> 
>         GETUUID1res
>             GETUUID(void) = 1;
>     } = 1;
> } = 0x20000002;
> 
> Then you need to discuss transport considerations:
> 
> - Whether this protocol is registered with the server's rpcbind
>   service,
> - Which TCP/UDP port number does it use? Assuming 2049, and that
>   it will appear on the same transport connection as NFS traffic
>   (just like NFACL).
> 
> Should it be supported on port 20049 with RDMA as well?

I don't think we should be that explicit.  We should way that requests
are sent to the same destination as the associated NFS requests.  No
mention of transports or addresses or ports.

NeilBrown


> 
> 
> > +
> > +From fs/nfsd/xdr.h
> > +
> > +struct nfsd_getuuidres {
> > +	__be32			status;
> > +	uuid_t			uuid;
> > +};
> > +
> > +From include/linux/nfs_xdr.h
> > +
> > +struct nfs_getuuidres {
> > +	__u8 *			uuid;
> > +};
> > +
> > +The client ultimately decodes this XDR using:
> > +xdr_stream_decode_opaque_fixed(xdr, result->uuid, UUID_SIZE);
> > +
> > +NFS Common and Server
> > +---------------------
> > +
> > +First use is in nfsd, to add access to a global nfsd_uuids list in
> > +nfs_common that is used to register and then identify local nfsd
> > +instances.
> > +
> > +nfsd_uuids is protected by the nfsd_mutex or RCU read lock and is
> > +composed of nfsd_uuid_t instances that are managed as nfsd creates them
> > +(per network namespace).
> > +
> > +nfsd_uuid_is_local() and nfsd_uuid_lookup() are used to search all local
> > +nfsd for the client specified nfsd uuid.
> > +
> > +The nfsd_uuids list is the basis for localio enablement, as such it has
> > +members that point to nfsd memory for direct use by the client
> > +(e.g. 'net' is the server's network namespace, through it the client can
> > +access nn->nfsd_serv with proper rcu read access).  It is this client
> > +and server synchronization that enables advanced usage and lifetime of
> > +objects to span from the host kernel's nfsd to per-container knfsd
> > +instances that are connected to nfs client's running on the same local
> > +host.
> > +
> > +NFS Client
> > +----------
> > +
> > +fs/nfs/localio.c:nfs_local_probe() will retrieve a server's uuid via
> > +LOCALIO protocol and check if the server with that uuid is known to be
> > +local.  This ensures client and server 1: support localio 2: are local
> > +to each other.
> > +
> > +See fs/nfs/localio.c:nfs_local_open_fh() and
> > +fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
> > +focused use of nfsd_uuid_t struct to allow a client local to a server to
> > +open a file pointer without needing to go over the network.
> > +
> > +The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
> > +server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
> > +both the nfsd network namespace and the associated nn->nfsd_serv in
> > +terms of RCU.  If nfsd_open_local_fh() finds that client no longer sees
> > +valid nfsd objects (be it struct net or nn->nfsd_serv) it return ENXIO
> > +to nfs_local_open_fh() and the client will try to reestablish the
> > +LOCALIO resources needed by calling nfs_local_probe() again.  This
> > +recovery is needed if/when an nfsd instance running in a container were
> > +to reboot while a localio client is connected to it.
> > +
> > +Testing
> > +-------
> > +
> > +The LOCALIO auxiliary protocol and associated NFS localio read, right
> > +and commit access have proven stable against various test scenarios but
> > +these have not yet been formalized in any testsuite:
> 
> Is there anywhere that describes what is needed to set up clients
> and a server to do local I/O? Then running the usual suite of NFS
> tests on that set up and comparing the nfsstat output on the local
> and remote clients should be a basic "smoke test" kind of thing
> that maintainers can use as a check-in test.
> 
> 
> > +
> > +-  Client and server both on localhost (for both v3 and v4.2).
> > +
> > +-  Various permutations of client and server support enablement for
> > +   both local and remote client and server.  Testing against NFS storage
> > +   products that don't support the LOCALIO protocol was also performed.
> > +
> > +-  Client on host, server within a container (for both v3 and v4.2)
> > +   The container testing was in terms of podman managed containers and
> > +   includes container stop/restart scenario.
> > diff --git a/include/linux/nfslocalio.h b/include/linux/nfslocalio.h
> > index c9592ad0afe2..a9722e18b527 100644
> > --- a/include/linux/nfslocalio.h
> > +++ b/include/linux/nfslocalio.h
> > @@ -20,6 +20,8 @@ extern struct list_head nfsd_uuids;
> >   * Each nfsd instance has an nfsd_uuid_t that is accessible through the
> >   * global nfsd_uuids list. Useful to allow a client to negotiate if localio
> >   * possible with its server.
> > + *
> > + * See Documentation/filesystems/nfs/localio.rst for more detail.
> >   */
> >  typedef struct {
> >  	uuid_t uuid;
> > -- 
> > 2.44.0
> > 
> 
> -- 
> Chuck Lever
>
Mike Snitzer June 20, 2024, 10:35 p.m. UTC | #5
On Fri, Jun 21, 2024 at 08:12:56AM +1000, NeilBrown wrote:
> On Thu, 20 Jun 2024, Chuck Lever wrote:
> > On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
> > > This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > added to the Linux NFS client and server (both v3 and v4) to allow a
> > > client and server to reliably handshake to determine if they are on the
> > > same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > the same connection as NFS traffic, follows the pattern established by
> > > the NFS ACL protocol extension.
> > > 
> > > The robust handshake between local client and server is just the
> > > beginning, the ultimate usecase this locality makes possible is the
> > > client is able to issue reads, writes and commits directly to the server
> > > without having to go over the network.  This is particularly useful for
> > > container usecases (e.g. kubernetes) where it is possible to run an IO
> > > job local to the server.
> > > 
> > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > ---
> > >  Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
> > >  include/linux/nfslocalio.h                |   2 +
> > >  2 files changed, 150 insertions(+)
> > >  create mode 100644 Documentation/filesystems/nfs/localio.rst
> > > 
> > > diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
> > > new file mode 100644
> > > index 000000000000..a43c3dab2cab
> > > --- /dev/null
> > > +++ b/Documentation/filesystems/nfs/localio.rst
> > > @@ -0,0 +1,148 @@
> > > +===========
> > > +NFS localio
> > > +===========
> > > +
> > > +This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > +added to the Linux NFS client and server (both v3 and v4) to allow a
> > > +client and server to reliably handshake to determine if they are on the
> > > +same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > +the same connection as NFS traffic, follows the pattern established by
> > > +the NFS ACL protocol extension.
> > > +
> > > +The LOCALIO auxiliary protocol is needed to allow robust discovery of
> > > +clients local to their servers.  Prior to this LOCALIO protocol a
> > > +fragile sockaddr network address based match against all local network
> > > +interfaces was attempted.  But unlike the LOCALIO protocol, the
> > > +sockaddr-based matching didn't handle use of iptables or containers.
> > > +
> > > +The robust handshake between local client and server is just the
> > > +beginning, the ultimate usecase this locality makes possible is the
> > > +client is able to issue reads, writes and commits directly to the server
> > > +without having to go over the network.  This is particularly useful for
> > > +container usecases (e.g. kubernetes) where it is possible to run an IO
> > > +job local to the server.
> > > +
> > > +The performance advantage realized from localio's ability to bypass
> > > +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
> > > +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
> > > +-  With localio:
> > > +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
> > > +-  Without localio:
> > > +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
> > > +
> > > +RPC
> > > +---
> > > +
> > > +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
> > > +method that allows the Linux nfs client to retrieve a Linux nfs server's
> > > +uuid.  This protocol isn't part of an IETF standard, nor does it need to
> > > +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
> > > +to an implementation detail.
> > > +
> > > +The GETUUID method encodes the server's uuid_t in terms of the fixed
> > > +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
> > > +methods are used instead of the less efficient variable sized methods.
> > > +
> > > +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
> > > +as 0x20000002 (but a request for a unique RPC program number assignment
> > > +has been submitted to IANA.org).
> > > +
> > > +The following approximately describes the LOCALIO in a pseudo rpcgen .x
> > > +syntax:
> > > +
> > > +#define UUID_SIZE 16
> > > +typedef u8 uuid_t<UUID_SIZE>;
> > > +
> > > +program NFS_LOCALIO_PROGRAM {
> > > +     version NULLVERS {
> > > +        void NULL(void) = 0;
> > > +	} = 1;
> > > +     version GETUUIDVERS {
> > > +        uuid_t GETUUID(void) = 1;
> > > +	} = 1;
> > > +} = 0x20000002;
> > > +
> > > +The above is the skeleton for the LOCALIO protocol, it doesn't account
> > > +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
> > > +is used to implement GETUUID.
> > > +
> > > +Here are the respective XDR results for nfsd and nfs:
> > 
> > Hi Mike!
> > 
> > A protocol spec describes the on-the-wire data formats, not the
> > in-memory structure layouts. The below C structures are not
> > relevant to this specification. This should be all you need here,
> > if I understand your protocol correctly:
> > 
> > /* raw RFC 9562 UUID */
> > #define UUID_SIZE 16
> > typedef u8 uuid_t<UUID_SIZE>;
> > 
> > union GETUUID1res switch (uint32 status) {
> 
> I don't think we need a status in the protocol.  GETUUID always returns
> a UUID.  There is no possible error condition.

By having localio use NFS's XDR we're able to piggyback on a status
being returned by standard NFS RPC handling.

See:
nfs3svc_encode_getuuidres and nfs4svc_encode_getuuidres.
nfs3_xdr_dec_getuuidres and nfs4_xdr_dec_getuuidres (and note the
FIXME comment about abusing nfs_opnum4).

Not sure how the reality of piggbacking on NFS XDR should be captured
in the protocol spec here.

> I don't think we need the NULL procedure either, but that is such a
> deeply entrenched practice I won't argue the point.

The code required it be there, I unfortunately don't recall specifics
on what failed if it wasn't there (may have been rpc_bind_new_program).

> > case 0:
> >     uuid_t  uuid;
> > default:
> >     void;
> > };
> > 
> > program NFS_LOCALIO_PROGRAM {
> >     version LOCALIO_V1 {
> >         void
> >             NULL(void) = 0;
> > 
> >         GETUUID1res
> >             GETUUID(void) = 1;
> >     } = 1;
> > } = 0x20000002;
> > 
> > Then you need to discuss transport considerations:
> > 
> > - Whether this protocol is registered with the server's rpcbind
> >   service,
> > - Which TCP/UDP port number does it use? Assuming 2049, and that
> >   it will appear on the same transport connection as NFS traffic
> >   (just like NFACL).
> > 
> > Should it be supported on port 20049 with RDMA as well?
> 
> I don't think we should be that explicit.  We should way that requests
> are sent to the same destination as the associated NFS requests.  No
> mention of transports or addresses or ports.

OK, I agree.

Thanks,
Mike
Chuck Lever June 20, 2024, 11:28 p.m. UTC | #6
On Thu, Jun 20, 2024 at 06:35:38PM -0400, Mike Snitzer wrote:
> On Fri, Jun 21, 2024 at 08:12:56AM +1000, NeilBrown wrote:
> > On Thu, 20 Jun 2024, Chuck Lever wrote:
> > > On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
> > > > This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > > added to the Linux NFS client and server (both v3 and v4) to allow a
> > > > client and server to reliably handshake to determine if they are on the
> > > > same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > > the same connection as NFS traffic, follows the pattern established by
> > > > the NFS ACL protocol extension.
> > > > 
> > > > The robust handshake between local client and server is just the
> > > > beginning, the ultimate usecase this locality makes possible is the
> > > > client is able to issue reads, writes and commits directly to the server
> > > > without having to go over the network.  This is particularly useful for
> > > > container usecases (e.g. kubernetes) where it is possible to run an IO
> > > > job local to the server.
> > > > 
> > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > ---
> > > >  Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
> > > >  include/linux/nfslocalio.h                |   2 +
> > > >  2 files changed, 150 insertions(+)
> > > >  create mode 100644 Documentation/filesystems/nfs/localio.rst
> > > > 
> > > > diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
> > > > new file mode 100644
> > > > index 000000000000..a43c3dab2cab
> > > > --- /dev/null
> > > > +++ b/Documentation/filesystems/nfs/localio.rst
> > > > @@ -0,0 +1,148 @@
> > > > +===========
> > > > +NFS localio
> > > > +===========
> > > > +
> > > > +This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > > +added to the Linux NFS client and server (both v3 and v4) to allow a
> > > > +client and server to reliably handshake to determine if they are on the
> > > > +same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > > +the same connection as NFS traffic, follows the pattern established by
> > > > +the NFS ACL protocol extension.
> > > > +
> > > > +The LOCALIO auxiliary protocol is needed to allow robust discovery of
> > > > +clients local to their servers.  Prior to this LOCALIO protocol a
> > > > +fragile sockaddr network address based match against all local network
> > > > +interfaces was attempted.  But unlike the LOCALIO protocol, the
> > > > +sockaddr-based matching didn't handle use of iptables or containers.
> > > > +
> > > > +The robust handshake between local client and server is just the
> > > > +beginning, the ultimate usecase this locality makes possible is the
> > > > +client is able to issue reads, writes and commits directly to the server
> > > > +without having to go over the network.  This is particularly useful for
> > > > +container usecases (e.g. kubernetes) where it is possible to run an IO
> > > > +job local to the server.
> > > > +
> > > > +The performance advantage realized from localio's ability to bypass
> > > > +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
> > > > +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
> > > > +-  With localio:
> > > > +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
> > > > +-  Without localio:
> > > > +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
> > > > +
> > > > +RPC
> > > > +---
> > > > +
> > > > +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
> > > > +method that allows the Linux nfs client to retrieve a Linux nfs server's
> > > > +uuid.  This protocol isn't part of an IETF standard, nor does it need to
> > > > +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
> > > > +to an implementation detail.
> > > > +
> > > > +The GETUUID method encodes the server's uuid_t in terms of the fixed
> > > > +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
> > > > +methods are used instead of the less efficient variable sized methods.
> > > > +
> > > > +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
> > > > +as 0x20000002 (but a request for a unique RPC program number assignment
> > > > +has been submitted to IANA.org).
> > > > +
> > > > +The following approximately describes the LOCALIO in a pseudo rpcgen .x
> > > > +syntax:
> > > > +
> > > > +#define UUID_SIZE 16
> > > > +typedef u8 uuid_t<UUID_SIZE>;
> > > > +
> > > > +program NFS_LOCALIO_PROGRAM {
> > > > +     version NULLVERS {
> > > > +        void NULL(void) = 0;
> > > > +	} = 1;
> > > > +     version GETUUIDVERS {
> > > > +        uuid_t GETUUID(void) = 1;
> > > > +	} = 1;
> > > > +} = 0x20000002;
> > > > +
> > > > +The above is the skeleton for the LOCALIO protocol, it doesn't account
> > > > +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
> > > > +is used to implement GETUUID.
> > > > +
> > > > +Here are the respective XDR results for nfsd and nfs:
> > > 
> > > Hi Mike!
> > > 
> > > A protocol spec describes the on-the-wire data formats, not the
> > > in-memory structure layouts. The below C structures are not
> > > relevant to this specification. This should be all you need here,
> > > if I understand your protocol correctly:
> > > 
> > > /* raw RFC 9562 UUID */
> > > #define UUID_SIZE 16
> > > typedef u8 uuid_t<UUID_SIZE>;
> > > 
> > > union GETUUID1res switch (uint32 status) {
> > 
> > I don't think we need a status in the protocol.  GETUUID always returns
> > a UUID.  There is no possible error condition.
> 
> By having localio use NFS's XDR we're able to piggyback on a status
> being returned by standard NFS RPC handling.
> 
> See:
> nfs3svc_encode_getuuidres and nfs4svc_encode_getuuidres.
> nfs3_xdr_dec_getuuidres and nfs4_xdr_dec_getuuidres (and note the
> FIXME comment about abusing nfs_opnum4).

No, let's not piggyback like that. Please make it a separate
XDR implementation just like NFSACL is. Again, LOCALIO is not
an extension of the NFS protocol. Making that claim confuses
people for whom the term "extension" has a very precise meaning.
If we were extending NFS, then yes, adding the new procedures
to the NFS XDR implementation is appropriate, but that's not
what you are doing: you are adding a new side-band protocol.

I have a long-term goal to ensure we can generate the source
code of the XDR layer of all the kernel RPC protocols via an
rpcgen like tool. A code generator can ensure that the
marshalling and unmarshalling code is memory-safe.

By piggybacking, you are building LOCALIO into another
protocol's XDR implementation, which makes it a special case,
and thus harder to implement via code that is generated
automatically from unmodified XDR language specs.

Maybe the client side maintainers don't care about that, but
please don't piggyback LOCALIO onto the NFSD's NFS XDR
implementation.

Then, if it's a separate implementation, you can remove the status
code. I was wondering why the server would reply with an error. If
LOCALIO/GETUUID is not supported, then an RPC level error occurs
anyway.

If you think you need an error like "Yes, I recognize that program
and procedure, but this file system doesn't allow local access
in any case" then that needs to be added to the protocol XDR
specification.


> Not sure how the reality of piggbacking on NFS XDR should be captured
> in the protocol spec here.

It's an implementation choice, so it doesn't belong in the protocol
spec that, again, lays out only on-the-wire behavior,.

Implementation specifics can be discussed in another section of the
document.


> > I don't think we need the NULL procedure either, but that is such a
> > deeply entrenched practice I won't argue the point.
> 
> The code required it be there, I unfortunately don't recall specifics
> on what failed if it wasn't there (may have been rpc_bind_new_program).

Please leave the NULL procedure where it is. Note that the NFS
program's NULL procedure is used in two rather significant ways:

 - As a carrier for GSSAPI messages
 - As a probe for the RPC-with-TLS capability

The latter might be significant if a client sends a LOCALIO
operation as the first RPC procedure when contacting an unfamiliar
server -- if it wants TLS protection for that procedure, then it
will need to send a NULL(RPC_AUTH_TLS) as the very first RPC
transaction.

Since LOCALIO/GETUUID is going over the network sometimes, it's
reasonable to expect that these security measures could be used.


> > > case 0:
> > >     uuid_t  uuid;
> > > default:
> > >     void;
> > > };
> > > 
> > > program NFS_LOCALIO_PROGRAM {
> > >     version LOCALIO_V1 {
> > >         void
> > >             NULL(void) = 0;
> > > 
> > >         GETUUID1res
> > >             GETUUID(void) = 1;
> > >     } = 1;
> > > } = 0x20000002;
> > > 
> > > Then you need to discuss transport considerations:
> > > 
> > > - Whether this protocol is registered with the server's rpcbind
> > >   service,
> > > - Which TCP/UDP port number does it use? Assuming 2049, and that
> > >   it will appear on the same transport connection as NFS traffic
> > >   (just like NFACL).
> > > 
> > > Should it be supported on port 20049 with RDMA as well?
> > 
> > I don't think we should be that explicit.  We should way that requests
> > are sent to the same destination as the associated NFS requests.  No
> > mention of transports or addresses or ports.
> 
> OK, I agree.
NeilBrown June 20, 2024, 11:42 p.m. UTC | #7
On Fri, 21 Jun 2024, Chuck Lever wrote:
> On Thu, Jun 20, 2024 at 06:35:38PM -0400, Mike Snitzer wrote:
> > On Fri, Jun 21, 2024 at 08:12:56AM +1000, NeilBrown wrote:
> > > On Thu, 20 Jun 2024, Chuck Lever wrote:
> > > > On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
> > > > > This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > > > added to the Linux NFS client and server (both v3 and v4) to allow a
> > > > > client and server to reliably handshake to determine if they are on the
> > > > > same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > > > the same connection as NFS traffic, follows the pattern established by
> > > > > the NFS ACL protocol extension.
> > > > > 
> > > > > The robust handshake between local client and server is just the
> > > > > beginning, the ultimate usecase this locality makes possible is the
> > > > > client is able to issue reads, writes and commits directly to the server
> > > > > without having to go over the network.  This is particularly useful for
> > > > > container usecases (e.g. kubernetes) where it is possible to run an IO
> > > > > job local to the server.
> > > > > 
> > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > > ---
> > > > >  Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
> > > > >  include/linux/nfslocalio.h                |   2 +
> > > > >  2 files changed, 150 insertions(+)
> > > > >  create mode 100644 Documentation/filesystems/nfs/localio.rst
> > > > > 
> > > > > diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
> > > > > new file mode 100644
> > > > > index 000000000000..a43c3dab2cab
> > > > > --- /dev/null
> > > > > +++ b/Documentation/filesystems/nfs/localio.rst
> > > > > @@ -0,0 +1,148 @@
> > > > > +===========
> > > > > +NFS localio
> > > > > +===========
> > > > > +
> > > > > +This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > > > +added to the Linux NFS client and server (both v3 and v4) to allow a
> > > > > +client and server to reliably handshake to determine if they are on the
> > > > > +same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > > > +the same connection as NFS traffic, follows the pattern established by
> > > > > +the NFS ACL protocol extension.
> > > > > +
> > > > > +The LOCALIO auxiliary protocol is needed to allow robust discovery of
> > > > > +clients local to their servers.  Prior to this LOCALIO protocol a
> > > > > +fragile sockaddr network address based match against all local network
> > > > > +interfaces was attempted.  But unlike the LOCALIO protocol, the
> > > > > +sockaddr-based matching didn't handle use of iptables or containers.
> > > > > +
> > > > > +The robust handshake between local client and server is just the
> > > > > +beginning, the ultimate usecase this locality makes possible is the
> > > > > +client is able to issue reads, writes and commits directly to the server
> > > > > +without having to go over the network.  This is particularly useful for
> > > > > +container usecases (e.g. kubernetes) where it is possible to run an IO
> > > > > +job local to the server.
> > > > > +
> > > > > +The performance advantage realized from localio's ability to bypass
> > > > > +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
> > > > > +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
> > > > > +-  With localio:
> > > > > +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
> > > > > +-  Without localio:
> > > > > +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
> > > > > +
> > > > > +RPC
> > > > > +---
> > > > > +
> > > > > +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
> > > > > +method that allows the Linux nfs client to retrieve a Linux nfs server's
> > > > > +uuid.  This protocol isn't part of an IETF standard, nor does it need to
> > > > > +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
> > > > > +to an implementation detail.
> > > > > +
> > > > > +The GETUUID method encodes the server's uuid_t in terms of the fixed
> > > > > +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
> > > > > +methods are used instead of the less efficient variable sized methods.
> > > > > +
> > > > > +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
> > > > > +as 0x20000002 (but a request for a unique RPC program number assignment
> > > > > +has been submitted to IANA.org).
> > > > > +
> > > > > +The following approximately describes the LOCALIO in a pseudo rpcgen .x
> > > > > +syntax:
> > > > > +
> > > > > +#define UUID_SIZE 16
> > > > > +typedef u8 uuid_t<UUID_SIZE>;
> > > > > +
> > > > > +program NFS_LOCALIO_PROGRAM {
> > > > > +     version NULLVERS {
> > > > > +        void NULL(void) = 0;
> > > > > +	} = 1;
> > > > > +     version GETUUIDVERS {
> > > > > +        uuid_t GETUUID(void) = 1;
> > > > > +	} = 1;
> > > > > +} = 0x20000002;
> > > > > +
> > > > > +The above is the skeleton for the LOCALIO protocol, it doesn't account
> > > > > +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
> > > > > +is used to implement GETUUID.
> > > > > +
> > > > > +Here are the respective XDR results for nfsd and nfs:
> > > > 
> > > > Hi Mike!
> > > > 
> > > > A protocol spec describes the on-the-wire data formats, not the
> > > > in-memory structure layouts. The below C structures are not
> > > > relevant to this specification. This should be all you need here,
> > > > if I understand your protocol correctly:
> > > > 
> > > > /* raw RFC 9562 UUID */
> > > > #define UUID_SIZE 16
> > > > typedef u8 uuid_t<UUID_SIZE>;
> > > > 
> > > > union GETUUID1res switch (uint32 status) {
> > > 
> > > I don't think we need a status in the protocol.  GETUUID always returns
> > > a UUID.  There is no possible error condition.
> > 
> > By having localio use NFS's XDR we're able to piggyback on a status
> > being returned by standard NFS RPC handling.
> > 
> > See:
> > nfs3svc_encode_getuuidres and nfs4svc_encode_getuuidres.
> > nfs3_xdr_dec_getuuidres and nfs4_xdr_dec_getuuidres (and note the
> > FIXME comment about abusing nfs_opnum4).
> 
> No, let's not piggyback like that. Please make it a separate
> XDR implementation just like NFSACL is. Again, LOCALIO is not
> an extension of the NFS protocol. Making that claim confuses
> people for whom the term "extension" has a very precise meaning.
> If we were extending NFS, then yes, adding the new procedures
> to the NFS XDR implementation is appropriate, but that's not
> what you are doing: you are adding a new side-band protocol.

I'm currently working through the LOCALIO protocol code to make it a
single version rather than '3' and '4'.  In the process I'm making it
completely separate from the NFS protocol implementation and cleaning up
some other bits.  e.g. it shouldn't register with rpcbind.

I'll hopefully post patches in a few hours.  I writing this now to
discourage Mike from starting work on this.

NeilBrown


> 
> I have a long-term goal to ensure we can generate the source
> code of the XDR layer of all the kernel RPC protocols via an
> rpcgen like tool. A code generator can ensure that the
> marshalling and unmarshalling code is memory-safe.
> 
> By piggybacking, you are building LOCALIO into another
> protocol's XDR implementation, which makes it a special case,
> and thus harder to implement via code that is generated
> automatically from unmodified XDR language specs.
> 
> Maybe the client side maintainers don't care about that, but
> please don't piggyback LOCALIO onto the NFSD's NFS XDR
> implementation.
> 
> Then, if it's a separate implementation, you can remove the status
> code. I was wondering why the server would reply with an error. If
> LOCALIO/GETUUID is not supported, then an RPC level error occurs
> anyway.
> 
> If you think you need an error like "Yes, I recognize that program
> and procedure, but this file system doesn't allow local access
> in any case" then that needs to be added to the protocol XDR
> specification.
> 
> 
> > Not sure how the reality of piggbacking on NFS XDR should be captured
> > in the protocol spec here.
> 
> It's an implementation choice, so it doesn't belong in the protocol
> spec that, again, lays out only on-the-wire behavior,.
> 
> Implementation specifics can be discussed in another section of the
> document.
> 
> 
> > > I don't think we need the NULL procedure either, but that is such a
> > > deeply entrenched practice I won't argue the point.
> > 
> > The code required it be there, I unfortunately don't recall specifics
> > on what failed if it wasn't there (may have been rpc_bind_new_program).
> 
> Please leave the NULL procedure where it is. Note that the NFS
> program's NULL procedure is used in two rather significant ways:
> 
>  - As a carrier for GSSAPI messages
>  - As a probe for the RPC-with-TLS capability
> 
> The latter might be significant if a client sends a LOCALIO
> operation as the first RPC procedure when contacting an unfamiliar
> server -- if it wants TLS protection for that procedure, then it
> will need to send a NULL(RPC_AUTH_TLS) as the very first RPC
> transaction.
> 
> Since LOCALIO/GETUUID is going over the network sometimes, it's
> reasonable to expect that these security measures could be used.
> 
> 
> > > > case 0:
> > > >     uuid_t  uuid;
> > > > default:
> > > >     void;
> > > > };
> > > > 
> > > > program NFS_LOCALIO_PROGRAM {
> > > >     version LOCALIO_V1 {
> > > >         void
> > > >             NULL(void) = 0;
> > > > 
> > > >         GETUUID1res
> > > >             GETUUID(void) = 1;
> > > >     } = 1;
> > > > } = 0x20000002;
> > > > 
> > > > Then you need to discuss transport considerations:
> > > > 
> > > > - Whether this protocol is registered with the server's rpcbind
> > > >   service,
> > > > - Which TCP/UDP port number does it use? Assuming 2049, and that
> > > >   it will appear on the same transport connection as NFS traffic
> > > >   (just like NFACL).
> > > > 
> > > > Should it be supported on port 20049 with RDMA as well?
> > > 
> > > I don't think we should be that explicit.  We should way that requests
> > > are sent to the same destination as the associated NFS requests.  No
> > > mention of transports or addresses or ports.
> > 
> > OK, I agree.
> 
> 
> -- 
> Chuck Lever
>
Mike Snitzer June 21, 2024, 12:28 a.m. UTC | #8
On Thu, Jun 20, 2024 at 07:28:14PM -0400, Chuck Lever wrote:
> On Thu, Jun 20, 2024 at 06:35:38PM -0400, Mike Snitzer wrote:
> > On Fri, Jun 21, 2024 at 08:12:56AM +1000, NeilBrown wrote:
> > > On Thu, 20 Jun 2024, Chuck Lever wrote:
> > > > On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
> > > > > This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > > > added to the Linux NFS client and server (both v3 and v4) to allow a
> > > > > client and server to reliably handshake to determine if they are on the
> > > > > same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > > > the same connection as NFS traffic, follows the pattern established by
> > > > > the NFS ACL protocol extension.
> > > > > 
> > > > > The robust handshake between local client and server is just the
> > > > > beginning, the ultimate usecase this locality makes possible is the
> > > > > client is able to issue reads, writes and commits directly to the server
> > > > > without having to go over the network.  This is particularly useful for
> > > > > container usecases (e.g. kubernetes) where it is possible to run an IO
> > > > > job local to the server.
> > > > > 
> > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > > ---
> > > > >  Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
> > > > >  include/linux/nfslocalio.h                |   2 +
> > > > >  2 files changed, 150 insertions(+)
> > > > >  create mode 100644 Documentation/filesystems/nfs/localio.rst
> > > > > 
> > > > > diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
> > > > > new file mode 100644
> > > > > index 000000000000..a43c3dab2cab
> > > > > --- /dev/null
> > > > > +++ b/Documentation/filesystems/nfs/localio.rst
> > > > > @@ -0,0 +1,148 @@
> > > > > +===========
> > > > > +NFS localio
> > > > > +===========
> > > > > +
> > > > > +This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > > > +added to the Linux NFS client and server (both v3 and v4) to allow a
> > > > > +client and server to reliably handshake to determine if they are on the
> > > > > +same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > > > +the same connection as NFS traffic, follows the pattern established by
> > > > > +the NFS ACL protocol extension.
> > > > > +
> > > > > +The LOCALIO auxiliary protocol is needed to allow robust discovery of
> > > > > +clients local to their servers.  Prior to this LOCALIO protocol a
> > > > > +fragile sockaddr network address based match against all local network
> > > > > +interfaces was attempted.  But unlike the LOCALIO protocol, the
> > > > > +sockaddr-based matching didn't handle use of iptables or containers.
> > > > > +
> > > > > +The robust handshake between local client and server is just the
> > > > > +beginning, the ultimate usecase this locality makes possible is the
> > > > > +client is able to issue reads, writes and commits directly to the server
> > > > > +without having to go over the network.  This is particularly useful for
> > > > > +container usecases (e.g. kubernetes) where it is possible to run an IO
> > > > > +job local to the server.
> > > > > +
> > > > > +The performance advantage realized from localio's ability to bypass
> > > > > +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
> > > > > +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
> > > > > +-  With localio:
> > > > > +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
> > > > > +-  Without localio:
> > > > > +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
> > > > > +
> > > > > +RPC
> > > > > +---
> > > > > +
> > > > > +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
> > > > > +method that allows the Linux nfs client to retrieve a Linux nfs server's
> > > > > +uuid.  This protocol isn't part of an IETF standard, nor does it need to
> > > > > +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
> > > > > +to an implementation detail.
> > > > > +
> > > > > +The GETUUID method encodes the server's uuid_t in terms of the fixed
> > > > > +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
> > > > > +methods are used instead of the less efficient variable sized methods.
> > > > > +
> > > > > +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
> > > > > +as 0x20000002 (but a request for a unique RPC program number assignment
> > > > > +has been submitted to IANA.org).
> > > > > +
> > > > > +The following approximately describes the LOCALIO in a pseudo rpcgen .x
> > > > > +syntax:
> > > > > +
> > > > > +#define UUID_SIZE 16
> > > > > +typedef u8 uuid_t<UUID_SIZE>;
> > > > > +
> > > > > +program NFS_LOCALIO_PROGRAM {
> > > > > +     version NULLVERS {
> > > > > +        void NULL(void) = 0;
> > > > > +	} = 1;
> > > > > +     version GETUUIDVERS {
> > > > > +        uuid_t GETUUID(void) = 1;
> > > > > +	} = 1;
> > > > > +} = 0x20000002;
> > > > > +
> > > > > +The above is the skeleton for the LOCALIO protocol, it doesn't account
> > > > > +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
> > > > > +is used to implement GETUUID.
> > > > > +
> > > > > +Here are the respective XDR results for nfsd and nfs:
> > > > 
> > > > Hi Mike!
> > > > 
> > > > A protocol spec describes the on-the-wire data formats, not the
> > > > in-memory structure layouts. The below C structures are not
> > > > relevant to this specification. This should be all you need here,
> > > > if I understand your protocol correctly:
> > > > 
> > > > /* raw RFC 9562 UUID */
> > > > #define UUID_SIZE 16
> > > > typedef u8 uuid_t<UUID_SIZE>;
> > > > 
> > > > union GETUUID1res switch (uint32 status) {
> > > 
> > > I don't think we need a status in the protocol.  GETUUID always returns
> > > a UUID.  There is no possible error condition.
> > 
> > By having localio use NFS's XDR we're able to piggyback on a status
> > being returned by standard NFS RPC handling.
> > 
> > See:
> > nfs3svc_encode_getuuidres and nfs4svc_encode_getuuidres.
> > nfs3_xdr_dec_getuuidres and nfs4_xdr_dec_getuuidres (and note the
> > FIXME comment about abusing nfs_opnum4).
> 
> No, let's not piggyback like that. Please make it a separate
> XDR implementation just like NFSACL is. Again, LOCALIO is not
> an extension of the NFS protocol. Making that claim confuses
> people for whom the term "extension" has a very precise meaning.
> If we were extending NFS, then yes, adding the new procedures
> to the NFS XDR implementation is appropriate, but that's not
> what you are doing: you are adding a new side-band protocol.

I reworded yesterday, when you seized on my saying "extension" in the
Documentation before and I changed it to "auxilliary" (I used
"extension" because that's how NFS ACL was framed, I had no idea it
was a loaded or nuanced term).  localio the protocol amounts to a
single RPC, no idea why I need to implement/duplicate a bunch of
boilerplate code to add a single GETUUID RPC. Localio as submitted
isn't concerned with some weird hijacking/extension of the NFS
protocol, it is enabling the NFS client and server to be more
efficient with how it fulfills IO.  Localio has no tie to NFS other
than it serves as connective glue between the NFS client and server
to optimize reads, writes, and commits.
 
> I have a long-term goal to ensure we can generate the source
> code of the XDR layer of all the kernel RPC protocols via an
> rpcgen like tool. A code generator can ensure that the
> marshalling and unmarshalling code is memory-safe.

The code doesn't show itself to be developed with such code generation
at all.  If anything it all feels very open-coded.

> By piggybacking, you are building LOCALIO into another
> protocol's XDR implementation, which makes it a special case,
> and thus harder to implement via code that is generated
> automatically from unmodified XDR language specs.

GETUUID really couldn't be simpler.  I'll do whatever you think best,
but I just want to be clear: when told to implement a side-band
protocol to coordinate I distilled it down to "I need a unique UUID
from the server" (hence GETUUID). When implementing it I set out to
work with what I saw in the existing code -- not reinvent boilerplate
code.  NFS ACL did that.  If I went further by (ab)using NFS xdr
helpers it was because I didn't want to write that code. If asked to
decouple, I'm just going to factor out cleaner code that LOCALIO and
NFS share to do what I already have it doing.

> Maybe the client side maintainers don't care about that, but
> please don't piggyback LOCALIO onto the NFSD's NFS XDR
> implementation.
> 
> Then, if it's a separate implementation, you can remove the status
> code. I was wondering why the server would reply with an error. If
> LOCALIO/GETUUID is not supported, then an RPC level error occurs
> anyway.

In my extensive discovery by trial and error I found that the NFS
status codes and infrastructure necessary. Otherwise basic/advanced
connectivity issues with sunrpc et al aren't accounted for properly.
NFS is pretty tightly coupled to sunrpc, and vice-versa.  Not
following why things have all of a sudden gotten "heightened", guess I
chose the wrong word ("piggybacking") when being very clear about what
the LOCALIO protocol does: piggyback on NFS XDR methods. I'm really
caught out for that being viewed as so bad (but my FIXME in
nfs4_xdr_dec_getuuidres really does need fixing).

> If you think you need an error like "Yes, I recognize that program
> and procedure, but this file system doesn't allow local access
> in any case" then that needs to be added to the protocol XDR
> specification.

The NFS status baked into NFS RPCs offer utility.

LOCALIO is not a standard, and we all agreed it doesn't need to be.
But it obviously cannot be fragile, etc.  SO I'm the new guy, happy to
"suffer".. however I would like to suffer less for being an open-book
about the code I've written and submitted ;)

> > Not sure how the reality of piggbacking on NFS XDR should be captured
> > in the protocol spec here.
> 
> It's an implementation choice, so it doesn't belong in the protocol
> spec that, again, lays out only on-the-wire behavior,.
> 
> Implementation specifics can be discussed in another section of the
> document.
> 
> 
> > > I don't think we need the NULL procedure either, but that is such a
> > > deeply entrenched practice I won't argue the point.
> > 
> > The code required it be there, I unfortunately don't recall specifics
> > on what failed if it wasn't there (may have been rpc_bind_new_program).
> 
> Please leave the NULL procedure where it is. Note that the NFS
> program's NULL procedure is used in two rather significant ways:
> 
>  - As a carrier for GSSAPI messages
>  - As a probe for the RPC-with-TLS capability
> 
> The latter might be significant if a client sends a LOCALIO
> operation as the first RPC procedure when contacting an unfamiliar
> server -- if it wants TLS protection for that procedure, then it
> will need to send a NULL(RPC_AUTH_TLS) as the very first RPC
> transaction.
> 
> Since LOCALIO/GETUUID is going over the network sometimes, it's
> reasonable to expect that these security measures could be used.

LOCALIO's GETUUID goes over the network all the time.  And it
literally needs to, and does, clone the NFS rpc client with:
rpc_bind_new_program(clp->cl_rpcclient

Mike
Mike Snitzer June 21, 2024, 12:30 a.m. UTC | #9
On Fri, Jun 21, 2024 at 09:42:26AM +1000, NeilBrown wrote:
> On Fri, 21 Jun 2024, Chuck Lever wrote:
> > On Thu, Jun 20, 2024 at 06:35:38PM -0400, Mike Snitzer wrote:
> > > On Fri, Jun 21, 2024 at 08:12:56AM +1000, NeilBrown wrote:
> > > > On Thu, 20 Jun 2024, Chuck Lever wrote:
> > > > > On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
> > > > > > This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > > > > added to the Linux NFS client and server (both v3 and v4) to allow a
> > > > > > client and server to reliably handshake to determine if they are on the
> > > > > > same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > > > > the same connection as NFS traffic, follows the pattern established by
> > > > > > the NFS ACL protocol extension.
> > > > > > 
> > > > > > The robust handshake between local client and server is just the
> > > > > > beginning, the ultimate usecase this locality makes possible is the
> > > > > > client is able to issue reads, writes and commits directly to the server
> > > > > > without having to go over the network.  This is particularly useful for
> > > > > > container usecases (e.g. kubernetes) where it is possible to run an IO
> > > > > > job local to the server.
> > > > > > 
> > > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > > > ---
> > > > > >  Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
> > > > > >  include/linux/nfslocalio.h                |   2 +
> > > > > >  2 files changed, 150 insertions(+)
> > > > > >  create mode 100644 Documentation/filesystems/nfs/localio.rst
> > > > > > 
> > > > > > diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
> > > > > > new file mode 100644
> > > > > > index 000000000000..a43c3dab2cab
> > > > > > --- /dev/null
> > > > > > +++ b/Documentation/filesystems/nfs/localio.rst
> > > > > > @@ -0,0 +1,148 @@
> > > > > > +===========
> > > > > > +NFS localio
> > > > > > +===========
> > > > > > +
> > > > > > +This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > > > > +added to the Linux NFS client and server (both v3 and v4) to allow a
> > > > > > +client and server to reliably handshake to determine if they are on the
> > > > > > +same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > > > > +the same connection as NFS traffic, follows the pattern established by
> > > > > > +the NFS ACL protocol extension.
> > > > > > +
> > > > > > +The LOCALIO auxiliary protocol is needed to allow robust discovery of
> > > > > > +clients local to their servers.  Prior to this LOCALIO protocol a
> > > > > > +fragile sockaddr network address based match against all local network
> > > > > > +interfaces was attempted.  But unlike the LOCALIO protocol, the
> > > > > > +sockaddr-based matching didn't handle use of iptables or containers.
> > > > > > +
> > > > > > +The robust handshake between local client and server is just the
> > > > > > +beginning, the ultimate usecase this locality makes possible is the
> > > > > > +client is able to issue reads, writes and commits directly to the server
> > > > > > +without having to go over the network.  This is particularly useful for
> > > > > > +container usecases (e.g. kubernetes) where it is possible to run an IO
> > > > > > +job local to the server.
> > > > > > +
> > > > > > +The performance advantage realized from localio's ability to bypass
> > > > > > +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
> > > > > > +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
> > > > > > +-  With localio:
> > > > > > +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
> > > > > > +-  Without localio:
> > > > > > +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
> > > > > > +
> > > > > > +RPC
> > > > > > +---
> > > > > > +
> > > > > > +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
> > > > > > +method that allows the Linux nfs client to retrieve a Linux nfs server's
> > > > > > +uuid.  This protocol isn't part of an IETF standard, nor does it need to
> > > > > > +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
> > > > > > +to an implementation detail.
> > > > > > +
> > > > > > +The GETUUID method encodes the server's uuid_t in terms of the fixed
> > > > > > +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
> > > > > > +methods are used instead of the less efficient variable sized methods.
> > > > > > +
> > > > > > +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
> > > > > > +as 0x20000002 (but a request for a unique RPC program number assignment
> > > > > > +has been submitted to IANA.org).
> > > > > > +
> > > > > > +The following approximately describes the LOCALIO in a pseudo rpcgen .x
> > > > > > +syntax:
> > > > > > +
> > > > > > +#define UUID_SIZE 16
> > > > > > +typedef u8 uuid_t<UUID_SIZE>;
> > > > > > +
> > > > > > +program NFS_LOCALIO_PROGRAM {
> > > > > > +     version NULLVERS {
> > > > > > +        void NULL(void) = 0;
> > > > > > +	} = 1;
> > > > > > +     version GETUUIDVERS {
> > > > > > +        uuid_t GETUUID(void) = 1;
> > > > > > +	} = 1;
> > > > > > +} = 0x20000002;
> > > > > > +
> > > > > > +The above is the skeleton for the LOCALIO protocol, it doesn't account
> > > > > > +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
> > > > > > +is used to implement GETUUID.
> > > > > > +
> > > > > > +Here are the respective XDR results for nfsd and nfs:
> > > > > 
> > > > > Hi Mike!
> > > > > 
> > > > > A protocol spec describes the on-the-wire data formats, not the
> > > > > in-memory structure layouts. The below C structures are not
> > > > > relevant to this specification. This should be all you need here,
> > > > > if I understand your protocol correctly:
> > > > > 
> > > > > /* raw RFC 9562 UUID */
> > > > > #define UUID_SIZE 16
> > > > > typedef u8 uuid_t<UUID_SIZE>;
> > > > > 
> > > > > union GETUUID1res switch (uint32 status) {
> > > > 
> > > > I don't think we need a status in the protocol.  GETUUID always returns
> > > > a UUID.  There is no possible error condition.
> > > 
> > > By having localio use NFS's XDR we're able to piggyback on a status
> > > being returned by standard NFS RPC handling.
> > > 
> > > See:
> > > nfs3svc_encode_getuuidres and nfs4svc_encode_getuuidres.
> > > nfs3_xdr_dec_getuuidres and nfs4_xdr_dec_getuuidres (and note the
> > > FIXME comment about abusing nfs_opnum4).
> > 
> > No, let's not piggyback like that. Please make it a separate
> > XDR implementation just like NFSACL is. Again, LOCALIO is not
> > an extension of the NFS protocol. Making that claim confuses
> > people for whom the term "extension" has a very precise meaning.
> > If we were extending NFS, then yes, adding the new procedures
> > to the NFS XDR implementation is appropriate, but that's not
> > what you are doing: you are adding a new side-band protocol.
> 
> I'm currently working through the LOCALIO protocol code to make it a
> single version rather than '3' and '4'.  In the process I'm making it
> completely separate from the NFS protocol implementation and cleaning up
> some other bits.  e.g. it shouldn't register with rpcbind.
> 
> I'll hopefully post patches in a few hours.  I writing this now to
> discourage Mike from starting work on this.

Cool, thanks Neil!
Mike Snitzer June 21, 2024, 12:38 a.m. UTC | #10
On Thu, Jun 20, 2024 at 08:30:04PM -0400, Mike Snitzer wrote:
> On Fri, Jun 21, 2024 at 09:42:26AM +1000, NeilBrown wrote:
> > On Fri, 21 Jun 2024, Chuck Lever wrote:
> > > On Thu, Jun 20, 2024 at 06:35:38PM -0400, Mike Snitzer wrote:
> > > > On Fri, Jun 21, 2024 at 08:12:56AM +1000, NeilBrown wrote:
> > > > > On Thu, 20 Jun 2024, Chuck Lever wrote:
> > > > > > On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
> > > > > > > This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > > > > > added to the Linux NFS client and server (both v3 and v4) to allow a
> > > > > > > client and server to reliably handshake to determine if they are on the
> > > > > > > same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > > > > > the same connection as NFS traffic, follows the pattern established by
> > > > > > > the NFS ACL protocol extension.
> > > > > > > 
> > > > > > > The robust handshake between local client and server is just the
> > > > > > > beginning, the ultimate usecase this locality makes possible is the
> > > > > > > client is able to issue reads, writes and commits directly to the server
> > > > > > > without having to go over the network.  This is particularly useful for
> > > > > > > container usecases (e.g. kubernetes) where it is possible to run an IO
> > > > > > > job local to the server.
> > > > > > > 
> > > > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > > > > > ---
> > > > > > >  Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
> > > > > > >  include/linux/nfslocalio.h                |   2 +
> > > > > > >  2 files changed, 150 insertions(+)
> > > > > > >  create mode 100644 Documentation/filesystems/nfs/localio.rst
> > > > > > > 
> > > > > > > diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..a43c3dab2cab
> > > > > > > --- /dev/null
> > > > > > > +++ b/Documentation/filesystems/nfs/localio.rst
> > > > > > > @@ -0,0 +1,148 @@
> > > > > > > +===========
> > > > > > > +NFS localio
> > > > > > > +===========
> > > > > > > +
> > > > > > > +This document gives an overview of the LOCALIO auxiliary RPC protocol
> > > > > > > +added to the Linux NFS client and server (both v3 and v4) to allow a
> > > > > > > +client and server to reliably handshake to determine if they are on the
> > > > > > > +same host.  The LOCALIO auxiliary protocol's implementation, which uses
> > > > > > > +the same connection as NFS traffic, follows the pattern established by
> > > > > > > +the NFS ACL protocol extension.
> > > > > > > +
> > > > > > > +The LOCALIO auxiliary protocol is needed to allow robust discovery of
> > > > > > > +clients local to their servers.  Prior to this LOCALIO protocol a
> > > > > > > +fragile sockaddr network address based match against all local network
> > > > > > > +interfaces was attempted.  But unlike the LOCALIO protocol, the
> > > > > > > +sockaddr-based matching didn't handle use of iptables or containers.
> > > > > > > +
> > > > > > > +The robust handshake between local client and server is just the
> > > > > > > +beginning, the ultimate usecase this locality makes possible is the
> > > > > > > +client is able to issue reads, writes and commits directly to the server
> > > > > > > +without having to go over the network.  This is particularly useful for
> > > > > > > +container usecases (e.g. kubernetes) where it is possible to run an IO
> > > > > > > +job local to the server.
> > > > > > > +
> > > > > > > +The performance advantage realized from localio's ability to bypass
> > > > > > > +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
> > > > > > > +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
> > > > > > > +-  With localio:
> > > > > > > +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
> > > > > > > +-  Without localio:
> > > > > > > +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
> > > > > > > +
> > > > > > > +RPC
> > > > > > > +---
> > > > > > > +
> > > > > > > +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
> > > > > > > +method that allows the Linux nfs client to retrieve a Linux nfs server's
> > > > > > > +uuid.  This protocol isn't part of an IETF standard, nor does it need to
> > > > > > > +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
> > > > > > > +to an implementation detail.
> > > > > > > +
> > > > > > > +The GETUUID method encodes the server's uuid_t in terms of the fixed
> > > > > > > +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
> > > > > > > +methods are used instead of the less efficient variable sized methods.
> > > > > > > +
> > > > > > > +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
> > > > > > > +as 0x20000002 (but a request for a unique RPC program number assignment
> > > > > > > +has been submitted to IANA.org).
> > > > > > > +
> > > > > > > +The following approximately describes the LOCALIO in a pseudo rpcgen .x
> > > > > > > +syntax:
> > > > > > > +
> > > > > > > +#define UUID_SIZE 16
> > > > > > > +typedef u8 uuid_t<UUID_SIZE>;
> > > > > > > +
> > > > > > > +program NFS_LOCALIO_PROGRAM {
> > > > > > > +     version NULLVERS {
> > > > > > > +        void NULL(void) = 0;
> > > > > > > +	} = 1;
> > > > > > > +     version GETUUIDVERS {
> > > > > > > +        uuid_t GETUUID(void) = 1;
> > > > > > > +	} = 1;
> > > > > > > +} = 0x20000002;
> > > > > > > +
> > > > > > > +The above is the skeleton for the LOCALIO protocol, it doesn't account
> > > > > > > +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
> > > > > > > +is used to implement GETUUID.
> > > > > > > +
> > > > > > > +Here are the respective XDR results for nfsd and nfs:
> > > > > > 
> > > > > > Hi Mike!
> > > > > > 
> > > > > > A protocol spec describes the on-the-wire data formats, not the
> > > > > > in-memory structure layouts. The below C structures are not
> > > > > > relevant to this specification. This should be all you need here,
> > > > > > if I understand your protocol correctly:
> > > > > > 
> > > > > > /* raw RFC 9562 UUID */
> > > > > > #define UUID_SIZE 16
> > > > > > typedef u8 uuid_t<UUID_SIZE>;
> > > > > > 
> > > > > > union GETUUID1res switch (uint32 status) {
> > > > > 
> > > > > I don't think we need a status in the protocol.  GETUUID always returns
> > > > > a UUID.  There is no possible error condition.
> > > > 
> > > > By having localio use NFS's XDR we're able to piggyback on a status
> > > > being returned by standard NFS RPC handling.
> > > > 
> > > > See:
> > > > nfs3svc_encode_getuuidres and nfs4svc_encode_getuuidres.
> > > > nfs3_xdr_dec_getuuidres and nfs4_xdr_dec_getuuidres (and note the
> > > > FIXME comment about abusing nfs_opnum4).
> > > 
> > > No, let's not piggyback like that. Please make it a separate
> > > XDR implementation just like NFSACL is. Again, LOCALIO is not
> > > an extension of the NFS protocol. Making that claim confuses
> > > people for whom the term "extension" has a very precise meaning.
> > > If we were extending NFS, then yes, adding the new procedures
> > > to the NFS XDR implementation is appropriate, but that's not
> > > what you are doing: you are adding a new side-band protocol.
> > 
> > I'm currently working through the LOCALIO protocol code to make it a
> > single version rather than '3' and '4'.  In the process I'm making it
> > completely separate from the NFS protocol implementation and cleaning up
> > some other bits.  e.g. it shouldn't register with rpcbind.
> > 
> > I'll hopefully post patches in a few hours.  I writing this now to
> > discourage Mike from starting work on this.
> 
> Cool, thanks Neil!

Oh, please base your changes on my latest nfs-localio-for-6.11 branch:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=nfs-localio-for-6.11
Chuck Lever June 21, 2024, 2:18 a.m. UTC | #11
> On Jun 20, 2024, at 8:28 PM, Mike Snitzer <snitzer@kernel.org> wrote:
> 
> On Thu, Jun 20, 2024 at 07:28:14PM -0400, Chuck Lever wrote:
>> On Thu, Jun 20, 2024 at 06:35:38PM -0400, Mike Snitzer wrote:
>>> On Fri, Jun 21, 2024 at 08:12:56AM +1000, NeilBrown wrote:
>>>> On Thu, 20 Jun 2024, Chuck Lever wrote:
>>>>> On Wed, Jun 19, 2024 at 04:40:31PM -0400, Mike Snitzer wrote:
>>>>>> This document gives an overview of the LOCALIO auxiliary RPC protocol
>>>>>> added to the Linux NFS client and server (both v3 and v4) to allow a
>>>>>> client and server to reliably handshake to determine if they are on the
>>>>>> same host.  The LOCALIO auxiliary protocol's implementation, which uses
>>>>>> the same connection as NFS traffic, follows the pattern established by
>>>>>> the NFS ACL protocol extension.
>>>>>> 
>>>>>> The robust handshake between local client and server is just the
>>>>>> beginning, the ultimate usecase this locality makes possible is the
>>>>>> client is able to issue reads, writes and commits directly to the server
>>>>>> without having to go over the network.  This is particularly useful for
>>>>>> container usecases (e.g. kubernetes) where it is possible to run an IO
>>>>>> job local to the server.
>>>>>> 
>>>>>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>>>>>> ---
>>>>>> Documentation/filesystems/nfs/localio.rst | 148 ++++++++++++++++++++++
>>>>>> include/linux/nfslocalio.h                |   2 +
>>>>>> 2 files changed, 150 insertions(+)
>>>>>> create mode 100644 Documentation/filesystems/nfs/localio.rst
>>>>>> 
>>>>>> diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
>>>>>> new file mode 100644
>>>>>> index 000000000000..a43c3dab2cab
>>>>>> --- /dev/null
>>>>>> +++ b/Documentation/filesystems/nfs/localio.rst
>>>>>> @@ -0,0 +1,148 @@
>>>>>> +===========
>>>>>> +NFS localio
>>>>>> +===========
>>>>>> +
>>>>>> +This document gives an overview of the LOCALIO auxiliary RPC protocol
>>>>>> +added to the Linux NFS client and server (both v3 and v4) to allow a
>>>>>> +client and server to reliably handshake to determine if they are on the
>>>>>> +same host.  The LOCALIO auxiliary protocol's implementation, which uses
>>>>>> +the same connection as NFS traffic, follows the pattern established by
>>>>>> +the NFS ACL protocol extension.
>>>>>> +
>>>>>> +The LOCALIO auxiliary protocol is needed to allow robust discovery of
>>>>>> +clients local to their servers.  Prior to this LOCALIO protocol a
>>>>>> +fragile sockaddr network address based match against all local network
>>>>>> +interfaces was attempted.  But unlike the LOCALIO protocol, the
>>>>>> +sockaddr-based matching didn't handle use of iptables or containers.
>>>>>> +
>>>>>> +The robust handshake between local client and server is just the
>>>>>> +beginning, the ultimate usecase this locality makes possible is the
>>>>>> +client is able to issue reads, writes and commits directly to the server
>>>>>> +without having to go over the network.  This is particularly useful for
>>>>>> +container usecases (e.g. kubernetes) where it is possible to run an IO
>>>>>> +job local to the server.
>>>>>> +
>>>>>> +The performance advantage realized from localio's ability to bypass
>>>>>> +using XDR and RPC for reads, writes and commits can be extreme, e.g.:
>>>>>> +fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
>>>>>> +-  With localio:
>>>>>> +  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
>>>>>> +-  Without localio:
>>>>>> +  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
>>>>>> +
>>>>>> +RPC
>>>>>> +---
>>>>>> +
>>>>>> +The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
>>>>>> +method that allows the Linux nfs client to retrieve a Linux nfs server's
>>>>>> +uuid.  This protocol isn't part of an IETF standard, nor does it need to
>>>>>> +be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
>>>>>> +to an implementation detail.
>>>>>> +
>>>>>> +The GETUUID method encodes the server's uuid_t in terms of the fixed
>>>>>> +UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
>>>>>> +methods are used instead of the less efficient variable sized methods.
>>>>>> +
>>>>>> +The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
>>>>>> +as 0x20000002 (but a request for a unique RPC program number assignment
>>>>>> +has been submitted to IANA.org).
>>>>>> +
>>>>>> +The following approximately describes the LOCALIO in a pseudo rpcgen .x
>>>>>> +syntax:
>>>>>> +
>>>>>> +#define UUID_SIZE 16
>>>>>> +typedef u8 uuid_t<UUID_SIZE>;
>>>>>> +
>>>>>> +program NFS_LOCALIO_PROGRAM {
>>>>>> +     version NULLVERS {
>>>>>> +        void NULL(void) = 0;
>>>>>> + } = 1;
>>>>>> +     version GETUUIDVERS {
>>>>>> +        uuid_t GETUUID(void) = 1;
>>>>>> + } = 1;
>>>>>> +} = 0x20000002;
>>>>>> +
>>>>>> +The above is the skeleton for the LOCALIO protocol, it doesn't account
>>>>>> +for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
>>>>>> +is used to implement GETUUID.
>>>>>> +
>>>>>> +Here are the respective XDR results for nfsd and nfs:
>>>>> 
>>>>> Hi Mike!
>>>>> 
>>>>> A protocol spec describes the on-the-wire data formats, not the
>>>>> in-memory structure layouts. The below C structures are not
>>>>> relevant to this specification. This should be all you need here,
>>>>> if I understand your protocol correctly:
>>>>> 
>>>>> /* raw RFC 9562 UUID */
>>>>> #define UUID_SIZE 16
>>>>> typedef u8 uuid_t<UUID_SIZE>;
>>>>> 
>>>>> union GETUUID1res switch (uint32 status) {
>>>> 
>>>> I don't think we need a status in the protocol.  GETUUID always returns
>>>> a UUID.  There is no possible error condition.
>>> 
>>> By having localio use NFS's XDR we're able to piggyback on a status
>>> being returned by standard NFS RPC handling.
>>> 
>>> See:
>>> nfs3svc_encode_getuuidres and nfs4svc_encode_getuuidres.
>>> nfs3_xdr_dec_getuuidres and nfs4_xdr_dec_getuuidres (and note the
>>> FIXME comment about abusing nfs_opnum4).
>> 
>> No, let's not piggyback like that. Please make it a separate
>> XDR implementation just like NFSACL is. Again, LOCALIO is not
>> an extension of the NFS protocol. Making that claim confuses
>> people for whom the term "extension" has a very precise meaning.
>> If we were extending NFS, then yes, adding the new procedures
>> to the NFS XDR implementation is appropriate, but that's not
>> what you are doing: you are adding a new side-band protocol.
> 
> I reworded yesterday, when you seized on my saying "extension" in the
> Documentation before and I changed it to "auxilliary" (I used
> "extension" because that's how NFS ACL was framed, I had no idea it
> was a loaded or nuanced term).  localio the protocol amounts to a
> single RPC, no idea why I need to implement/duplicate a bunch of
> boilerplate code to add a single GETUUID RPC. Localio as submitted
> isn't concerned with some weird hijacking/extension of the NFS
> protocol, it is enabling the NFS client and server to be more
> efficient with how it fulfills IO.  Localio has no tie to NFS other
> than it serves as connective glue between the NFS client and server
> to optimize reads, writes, and commits.

What I'm saying is please don't implement this as

   nfs3_xdr_dec_getuuidres and nfs4_xdr_dec_getuuidres

GETUUID is not an NFS procedure, as you say. Just
use:

   localio1_xdr_dec_getuuidres

Program name and version, XDR decoder, result data
type name. This is consistent with every other
modern (ie, recently-written) XDR function.


>> I have a long-term goal to ensure we can generate the source
>> code of the XDR layer of all the kernel RPC protocols via an
>> rpcgen like tool. A code generator can ensure that the
>> marshalling and unmarshalling code is memory-safe.
> 
> The code doesn't show itself to be developed with such code generation
> at all.  If anything it all feels very open-coded.

That's correct, it was open-coded in the 90's because
back then that was the way to build code that was
fast.

These days, compilers and CPUs optimize the shit out
of the code and instruction sequence so it matters
much less what we do in our source files. Since this
is a community-maintained effort, it makes sense to
spend our effort to use the source code as
documentation for the humans who have to read it
and update it, anywhere the utmost speed is not a
requirement.

In particular, right now there's no easy way to audit
our XDR implementation and prove/verify that we have
implemented exactly what the spec says. And, the open
coded nature of this implementation means it is
subject to memory bugs, which is bad for a widely
deployed network protocol.

In user space, RPC developers almost always start with
a .x file and use rpcgen to generate C (or Python)
code to handle marshaling of arguments and results.
That way, the XDR implementation starts with C code
that comes directly from the specification, untouched
by human hands.

We really need to take that approach in the kernel
too. The only reason not to do that is because there
is decades of technical debt baked into the current
open coded implementation. But I've spent a lot of
effort to move the server over to consistent and
properly layered XDR functions so that converting
it to use machine-generated XDR will be a burp
rather than a heart attack. Lots of work yet to do.


>> By piggybacking, you are building LOCALIO into another
>> protocol's XDR implementation, which makes it a special case,
>> and thus harder to implement via code that is generated
>> automatically from unmodified XDR language specs.
> 
> GETUUID really couldn't be simpler.  I'll do whatever you think best,
> but I just want to be clear: when told to implement a side-band
> protocol to coordinate I distilled it down to "I need a unique UUID
> from the server" (hence GETUUID). When implementing it I set out to
> work with what I saw in the existing code -- not reinvent boilerplate
> code.  NFS ACL did that.  If I went further by (ab)using NFS xdr
> helpers it was because I didn't want to write that code. If asked to
> decouple, I'm just going to factor out cleaner code that LOCALIO and
> NFS share to do what I already have it doing.

That "clean shared code" is the RPC XDR layer. There's
really no need to share any NFS-specific code between
NFS and LOCALIO. Please have a look at
include/linux/sunrpc/xdr.h for utility functions to
use for XDR; I suspect you will find everything you
need.

And I recommend having a look at

  https://www.rfc-editor.org/rfc/rfc4506

for more detail about what XDR is and why we use it.


>> Maybe the client side maintainers don't care about that, but
>> please don't piggyback LOCALIO onto the NFSD's NFS XDR
>> implementation.
>> 
>> Then, if it's a separate implementation, you can remove the status
>> code. I was wondering why the server would reply with an error. If
>> LOCALIO/GETUUID is not supported, then an RPC level error occurs
>> anyway.
> 
> In my extensive discovery by trial and error I found that the NFS
> status codes and infrastructure necessary. Otherwise basic/advanced
> connectivity issues with sunrpc et al aren't accounted for properly.

That is quite the opposite of my experience in
this code base. Can you share an example of the
connectivity issues you encountered?


> NFS is pretty tightly coupled to sunrpc, and vice-versa.  Not
> following why things have all of a sudden gotten "heightened", guess I
> chose the wrong word ("piggybacking") when being very clear about what
> the LOCALIO protocol does: piggyback on NFS XDR methods. I'm really
> caught out for that being viewed as so bad (but my FIXME in
> nfs4_xdr_dec_getuuidres really does need fixing).

Thanks for being clear about it, that is quite
helpful. But I'd rather this was implemented the way
every other RPC protocol is already done in the
kernel. There's more than one reason to do it that
way.


>> If you think you need an error like "Yes, I recognize that program
>> and procedure, but this file system doesn't allow local access
>> in any case" then that needs to be added to the protocol XDR
>> specification.
> 
> The NFS status baked into NFS RPCs offer utility.

I'm agnostic about whether GETUUID1res has a status
field or not... When Neil pointed it out, I was just
requesting that we have a little more rationale for
it than "the XDR layer seems to need this" because,
IME, it doesn't.

Note that RPC allows a little extensibility: if
you send an unrecognized procedure number, the
RPC server will bark at you. So we can add new
procedures without much difficulty; the client can
tell if the server doesn't support them.

And if the argument or result data items have to
change in existing procedures, then we create a
new version of the RPC program.

So there isn't a strong need to add fields we think
might be needed some day but don't have a use for
now, even in a private non-standard protocol.


--
Chuck Lever
diff mbox series

Patch

diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
new file mode 100644
index 000000000000..a43c3dab2cab
--- /dev/null
+++ b/Documentation/filesystems/nfs/localio.rst
@@ -0,0 +1,148 @@ 
+===========
+NFS localio
+===========
+
+This document gives an overview of the LOCALIO auxiliary RPC protocol
+added to the Linux NFS client and server (both v3 and v4) to allow a
+client and server to reliably handshake to determine if they are on the
+same host.  The LOCALIO auxiliary protocol's implementation, which uses
+the same connection as NFS traffic, follows the pattern established by
+the NFS ACL protocol extension.
+
+The LOCALIO auxiliary protocol is needed to allow robust discovery of
+clients local to their servers.  Prior to this LOCALIO protocol a
+fragile sockaddr network address based match against all local network
+interfaces was attempted.  But unlike the LOCALIO protocol, the
+sockaddr-based matching didn't handle use of iptables or containers.
+
+The robust handshake between local client and server is just the
+beginning, the ultimate usecase this locality makes possible is the
+client is able to issue reads, writes and commits directly to the server
+without having to go over the network.  This is particularly useful for
+container usecases (e.g. kubernetes) where it is possible to run an IO
+job local to the server.
+
+The performance advantage realized from localio's ability to bypass
+using XDR and RPC for reads, writes and commits can be extreme, e.g.:
+fio for 20 secs with 24 libaio threads, 64k directio reads, qd of 8,
+-  With localio:
+  read: IOPS=691k, BW=42.2GiB/s (45.3GB/s)(843GiB/20002msec)
+-  Without localio:
+  read: IOPS=15.7k, BW=984MiB/s (1032MB/s)(19.2GiB/20013msec)
+
+RPC
+---
+
+The LOCALIO auxiliary RPC protocol consists of a single "GETUUID" RPC
+method that allows the Linux nfs client to retrieve a Linux nfs server's
+uuid.  This protocol isn't part of an IETF standard, nor does it need to
+be considering it is Linux-to-Linux auxiliary RPC protocol that amounts
+to an implementation detail.
+
+The GETUUID method encodes the server's uuid_t in terms of the fixed
+UUID_SIZE (16 bytes).  The fixed size opaque encode and decode XDR
+methods are used instead of the less efficient variable sized methods.
+
+The RPC program number for the NFS_LOCALIO_PROGRAM is currently defined
+as 0x20000002 (but a request for a unique RPC program number assignment
+has been submitted to IANA.org).
+
+The following approximately describes the LOCALIO in a pseudo rpcgen .x
+syntax:
+
+#define UUID_SIZE 16
+typedef u8 uuid_t<UUID_SIZE>;
+
+program NFS_LOCALIO_PROGRAM {
+     version NULLVERS {
+        void NULL(void) = 0;
+	} = 1;
+     version GETUUIDVERS {
+        uuid_t GETUUID(void) = 1;
+	} = 1;
+} = 0x20000002;
+
+The above is the skeleton for the LOCALIO protocol, it doesn't account
+for NFS v3 and v4 RPC boilerplate (which also marshalls RPC status) that
+is used to implement GETUUID.
+
+Here are the respective XDR results for nfsd and nfs:
+
+From fs/nfsd/xdr.h
+
+struct nfsd_getuuidres {
+	__be32			status;
+	uuid_t			uuid;
+};
+
+From include/linux/nfs_xdr.h
+
+struct nfs_getuuidres {
+	__u8 *			uuid;
+};
+
+The client ultimately decodes this XDR using:
+xdr_stream_decode_opaque_fixed(xdr, result->uuid, UUID_SIZE);
+
+NFS Common and Server
+---------------------
+
+First use is in nfsd, to add access to a global nfsd_uuids list in
+nfs_common that is used to register and then identify local nfsd
+instances.
+
+nfsd_uuids is protected by the nfsd_mutex or RCU read lock and is
+composed of nfsd_uuid_t instances that are managed as nfsd creates them
+(per network namespace).
+
+nfsd_uuid_is_local() and nfsd_uuid_lookup() are used to search all local
+nfsd for the client specified nfsd uuid.
+
+The nfsd_uuids list is the basis for localio enablement, as such it has
+members that point to nfsd memory for direct use by the client
+(e.g. 'net' is the server's network namespace, through it the client can
+access nn->nfsd_serv with proper rcu read access).  It is this client
+and server synchronization that enables advanced usage and lifetime of
+objects to span from the host kernel's nfsd to per-container knfsd
+instances that are connected to nfs client's running on the same local
+host.
+
+NFS Client
+----------
+
+fs/nfs/localio.c:nfs_local_probe() will retrieve a server's uuid via
+LOCALIO protocol and check if the server with that uuid is known to be
+local.  This ensures client and server 1: support localio 2: are local
+to each other.
+
+See fs/nfs/localio.c:nfs_local_open_fh() and
+fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
+focused use of nfsd_uuid_t struct to allow a client local to a server to
+open a file pointer without needing to go over the network.
+
+The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
+server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
+both the nfsd network namespace and the associated nn->nfsd_serv in
+terms of RCU.  If nfsd_open_local_fh() finds that client no longer sees
+valid nfsd objects (be it struct net or nn->nfsd_serv) it return ENXIO
+to nfs_local_open_fh() and the client will try to reestablish the
+LOCALIO resources needed by calling nfs_local_probe() again.  This
+recovery is needed if/when an nfsd instance running in a container were
+to reboot while a localio client is connected to it.
+
+Testing
+-------
+
+The LOCALIO auxiliary protocol and associated NFS localio read, right
+and commit access have proven stable against various test scenarios but
+these have not yet been formalized in any testsuite:
+
+-  Client and server both on localhost (for both v3 and v4.2).
+
+-  Various permutations of client and server support enablement for
+   both local and remote client and server.  Testing against NFS storage
+   products that don't support the LOCALIO protocol was also performed.
+
+-  Client on host, server within a container (for both v3 and v4.2)
+   The container testing was in terms of podman managed containers and
+   includes container stop/restart scenario.
diff --git a/include/linux/nfslocalio.h b/include/linux/nfslocalio.h
index c9592ad0afe2..a9722e18b527 100644
--- a/include/linux/nfslocalio.h
+++ b/include/linux/nfslocalio.h
@@ -20,6 +20,8 @@  extern struct list_head nfsd_uuids;
  * Each nfsd instance has an nfsd_uuid_t that is accessible through the
  * global nfsd_uuids list. Useful to allow a client to negotiate if localio
  * possible with its server.
+ *
+ * See Documentation/filesystems/nfs/localio.rst for more detail.
  */
 typedef struct {
 	uuid_t uuid;