diff mbox series

nfs: reexport documentation

Message ID 20210921143259.GB21704@fieldses.org (mailing list archive)
State New, archived
Headers show
Series nfs: reexport documentation | expand

Commit Message

J. Bruce Fields Sept. 21, 2021, 2:32 p.m. UTC
From: "J. Bruce Fields" <bfields@redhat.com>

We've supported reexport for a while but documentation is limited.  This
is mainly a simplified version of the text I wrote for the linux-nfs
wiki at https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 Documentation/filesystems/nfs/index.rst    |   1 +
 Documentation/filesystems/nfs/reexport.rst | 110 +++++++++++++++++++++
 2 files changed, 111 insertions(+)
 create mode 100644 Documentation/filesystems/nfs/reexport.rst

Comments

Chuck Lever Sept. 21, 2021, 2:39 p.m. UTC | #1
> On Sep 21, 2021, at 10:32 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> We've supported reexport for a while but documentation is limited.  This
> is mainly a simplified version of the text I wrote for the linux-nfs
> wiki at https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export.
> 
> Signed-off-by: J. Bruce Fields <bfields@redhat.com>

Thanks for posting this, Bruce! Comments inline.


> ---
> Documentation/filesystems/nfs/index.rst    |   1 +
> Documentation/filesystems/nfs/reexport.rst | 110 +++++++++++++++++++++
> 2 files changed, 111 insertions(+)
> create mode 100644 Documentation/filesystems/nfs/reexport.rst
> 
> diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
> index 65805624e39b..288d8ddb2bc6 100644
> --- a/Documentation/filesystems/nfs/index.rst
> +++ b/Documentation/filesystems/nfs/index.rst
> @@ -11,3 +11,4 @@ NFS
>    rpc-server-gss
>    nfs41-server
>    knfsd-stats
> +   reexport
> diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
> new file mode 100644
> index 000000000000..892cb1e9c45c
> --- /dev/null
> +++ b/Documentation/filesystems/nfs/reexport.rst
> @@ -0,0 +1,110 @@
> +Reexporting NFS filesystems
> +===========================
> +
> +Overview
> +--------
> +
> +It is possible to reexport an NFS filesystem over NFS.  However, this
> +feature comes with a number of limitations.  Before trying it, we
> +recommend some careful research to determine wether it will work for
> +your purposes.

^wether^whether


> +
> +A discussion of current known limitations follows.
> +
> +"fsid=" required, crossmnt broken
> +---------------------------------
> +
> +We require the "fsid=" export option on any reexport of an NFS
> +filesystem.

Recommended approach? I would just say use 'uuidgen -r'


> +The "crossmnt" export option does not work in the reexport case.

Can you expand on this a little? Consequences? Risks?


> +Reboot recovery
> +---------------
> +
> +The NFS protocol's normal reboot recovery mechanisms don't work for the
> +case when the reexport server reboots.  Clients will lose any locks
> +they held before the reboot, and further IO will result in errors.
> +Closing and reopening files should clear the errors.

Any recommended workarounds? Or does this simply mean that
administrators need to notify client users to unmount (or
at least stop their workloads) before rebooting the proxy?


> +Filehandle limits
> +-----------------
> +
> +If the original server uses an X byte filehandle for a given object, the
> +reexport server's filehandle for the reexported object will be X+22
> +bytes, rounded up to the nearest multiple of four bytes.
> +
> +The result must fit into the RFC-mandated filehandle size limits:
> +
> ++-------+-----------+
> +| NFSv2 |  32 bytes |
> ++-------+-----------+
> +| NFSv3 |  64 bytes |
> ++-------+-----------+
> +| NFSv4 | 128 bytes |
> ++-------+-----------+
> +
> +So, for example, you will only be able to reexport a filesystem over
> +NFSv2 if the original server gives you filehandles that fit in 10
> +bytes--which is unlikely.
> +
> +In general there's no way to know the maximum filehandle size given out
> +by an NFS server without asking the server vendor.
> +
> +But the following table gives a few examples.  The first column is the
> +typical length of the filehandle from a Linux server exporting the given
> +filesystem, the second is the length after that nfs export is reexported
> +by another Linux host:
> +
> ++--------+-------------------+----------------+
> +|        | filehandle length | after reexport |
> ++========+===================+================+
> +| ext4:  | 28 bytes          | 52 bytes       |
> ++--------+-------------------+----------------+
> +| xfs:   | 32 bytes          | 56 bytes       |
> ++--------+-------------------+----------------+
> +| btrfs: | 40 bytes          | 64 bytes       |
> ++--------+-------------------+----------------+
> +
> +All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport,
> +but none are reexportable over NFSv2.
> +
> +Linux server filehandles are a bit more complicated than this, though;
> +for example:
> +
> +        - The (non-default) "subtreecheck" export option generally
> +          requires another 4 to 8 bytes in the filehandle.
> +        - If you export a subdirectory of a filesystem (instead of
> +          exporting the filesystem root), that also usually adds 4 to 8
> +          bytes.
> +        - If you export over NFSv2, knfsd usually uses a shorter
> +          filesystem identifier that saves 8 bytes.
> +        - The root directory of an export uses a filehandle that is
> +          shorter.
> +
> +As you can see, the 128-byte NFSv4 filehandle is large enough that
> +you're unlikely to have trouble using NFSv4 to reexport any filesystem
> +exported from a Linux server.  In general, if the original server is
> +something that also supports NFSv3, you're *probably* OK.  Re-exporting
> +over NFSv3 may be dicier, and reexporting over NFSv2 will probably
> +never work.
> +
> +For more details of Linux filehandle structure, the best reference is
> +the source code and comments; see in particular:
> +
> +        - include/linux/exportfs.h:enum fid_type
> +        - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new
> +        - fs/nfsd/nfsfh.c:set_version_and_fsid_type
> +        - fs/nfs/export.c:nfs_encode_fh
> +
> +Open DENY bits ignored
> +----------------------
> +
> +NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which
> +allow you, for example, to open a file in a mode which forbids other
> +read opens or write opens. The Linux client doesn't use them, and the
> +server's support has always been incomplete: they are enforced only
> +against other NFS users, not against processes accessing the exported
> +filesystem locally. A reexport server will also not pass them along to
> +the original server, so they will not be enforced between clients of
> +different reexport servers.
> -- 
> 2.31.1
> 

--
Chuck Lever
J. Bruce Fields Sept. 21, 2021, 4 p.m. UTC | #2
On Tue, Sep 21, 2021 at 02:39:39PM +0000, Chuck Lever III wrote:
> > On Sep 21, 2021, at 10:32 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > +It is possible to reexport an NFS filesystem over NFS.  However, this
> > +feature comes with a number of limitations.  Before trying it, we
> > +recommend some careful research to determine wether it will work for
> > +your purposes.
> 
> ^wether^whether

Fixed.

> > +
> > +A discussion of current known limitations follows.
> > +
> > +"fsid=" required, crossmnt broken
> > +---------------------------------
> > +
> > +We require the "fsid=" export option on any reexport of an NFS
> > +filesystem.
> 
> Recommended approach? I would just say use 'uuidgen -r'

Looking at the manual.... I'd somehow missed that fsid= would take a
uuid (and not just a small integer) now.  So, sure, I'll add that as a
suggestion.

Longer term I wonder if it would work to do this automatically for new
nfs reexports.  The annoying part is you'd have to keep the fsid=
argument on disk somehow, either by modifying the export configuration
in /etc or by keeping them on the side somewhere.  That'd fix crossmnt
too.

> > +The "crossmnt" export option does not work in the reexport case.
> 
> Can you expand on this a little? Consequences? Risks?

crossmnt doesn't propagate fsid= (for obvious reasons), so if you cross
into another nfs filesystem then it'll fail.

Actually if you just had disk filesystems mounted underneath it'd
probably work.

> > +Reboot recovery
> > +---------------
> > +
> > +The NFS protocol's normal reboot recovery mechanisms don't work for the
> > +case when the reexport server reboots.  Clients will lose any locks
> > +they held before the reboot, and further IO will result in errors.
> > +Closing and reopening files should clear the errors.
> 
> Any recommended workarounds? Or does this simply mean that
> administrators need to notify client users to unmount (or
> at least stop their workloads) before rebooting the proxy?

I think so.

If you don't use any file locking or delegations I suppose you're also
OK.  Delegations might be useful, though.

I'd expect reexport to be useful mainly for data that changes very
rarely, if that helps.

--b.

diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
index 892cb1e9c45c..ff9ae4a46530 100644
--- a/Documentation/filesystems/nfs/reexport.rst
+++ b/Documentation/filesystems/nfs/reexport.rst
@@ -6,7 +6,7 @@ Overview
 
 It is possible to reexport an NFS filesystem over NFS.  However, this
 feature comes with a number of limitations.  Before trying it, we
-recommend some careful research to determine wether it will work for
+recommend some careful research to determine whether it will work for
 your purposes.
 
 A discussion of current known limitations follows.
@@ -15,9 +15,12 @@ A discussion of current known limitations follows.
 ---------------------------------
 
 We require the "fsid=" export option on any reexport of an NFS
-filesystem.
+filesystem.  You can use "uuidgen -r" to generate a unique argument.
 
-The "crossmnt" export option does not work in the reexport case.
+The "crossmnt" export does not propagate "fsid=", so it will not allow
+traversing into further nfs filesystems; if you wish to export nfs
+filesystems mounted under the exported filesystem, you'll need to export
+them explicitly, assigning each its own unique "fsid= option.
 
 Reboot recovery
 ---------------
Daire Byrne Sept. 22, 2021, 9:47 a.m. UTC | #3
On Tue, 21 Sept 2021 at 17:00, Bruce Fields <bfields@fieldses.org> wrote:
>
> > Any recommended workarounds? Or does this simply mean that
> > administrators need to notify client users to unmount (or
> > at least stop their workloads) before rebooting the proxy?
>
> I think so.
>
> If you don't use any file locking or delegations I suppose you're also
> OK.  Delegations might be useful, though.
>
> I'd expect reexport to be useful mainly for data that changes very
> rarely, if that helps.
>
> --b.

Firstly, it's great to see this documentation and the well maintained
wiki page for something we use in production (a lot) - thanks Bruce!

I can only draw on our experience to say:
* if the nfs re-export server doesn't crash, we rarely have cause to reboot it.
* we re-export read-only software repositories to WAN/cloud instances
(an ideal use case).
* we also re-export read/write production storage but every client
process is writing unique files - there are no writes to the same
file(s) from any clients of the re-export server.

We don't use or need crossmnt functionality, but I know from chatting
to others within our industry that the fsid/crossmnt limitation causes
them the most grief and confusion. I think in the case of Netapps,
there are similar problems with trying to re-export a namespace made
up of different volumes?

As noted on the wiki, the only way around that is probably to have a
"proxy" mode (similar to what ganesha does?).

Daire
J. Bruce Fields Sept. 22, 2021, 2:26 p.m. UTC | #4
On Wed, Sep 22, 2021 at 10:47:35AM +0100, Daire Byrne wrote:
> We don't use or need crossmnt functionality, but I know from chatting
> to others within our industry that the fsid/crossmnt limitation causes
> them the most grief and confusion. I think in the case of Netapps,
> there are similar problems with trying to re-export a namespace made
> up of different volumes?
> 
> As noted on the wiki, the only way around that is probably to have a
> "proxy" mode (similar to what ganesha does?).

I'm not sure what Ganesha does.  I spent some time thinking about and
couldn't figure out how to do it, at least not on my own in a reasonable
amount of time.

I liked the idea of limiting the proxy to reexport only one original
server and reusing the filehandles from the original server without any
wrapping.  That addresses the fsid/crossmnt limitation and filehandle
length limitations.  It means proxies all share the same filehandles so
eventually you could also implement migration and failover between them
and the original server.

It means when you get a filehandle the only way to find out *anything*
about it--even what filesystem it's from--is to go ask the server.
That's slow, so you need a filehandle cache.  You have to deal with the
case where you get a filehandle for an object that isn't mounted yet.
Its parents may not be mounted yet either.  If it's a regular file you
can't call LOOKUPP on it.  I'm not sure how to handle the vfs details in
that case--how do you fake up a superblock and vfsmount?

Simpler might be to give up on that idea of reusing the original
server's filehandle, and automatically generate and persistently store
uuids for new filesystems as you discover them.

I don't know, I'm rambling.

--b.
Frank Filz Sept. 22, 2021, 2:32 p.m. UTC | #5
> -----Original Message-----
> From: Daire Byrne [mailto:daire@dneg.com]
> Sent: Wednesday, September 22, 2021 2:48 AM
> To: Bruce Fields <bfields@fieldses.org>
> Cc: Chuck Lever III <chuck.lever@oracle.com>; Linux NFS Mailing List <linux-
> nfs@vger.kernel.org>
> Subject: Re: [PATCH] nfs: reexport documentation
> 
> On Tue, 21 Sept 2021 at 17:00, Bruce Fields <bfields@fieldses.org> wrote:
> >
> > > Any recommended workarounds? Or does this simply mean that
> > > administrators need to notify client users to unmount (or at least
> > > stop their workloads) before rebooting the proxy?
> >
> > I think so.
> >
> > If you don't use any file locking or delegations I suppose you're also
> > OK.  Delegations might be useful, though.
> >
> > I'd expect reexport to be useful mainly for data that changes very
> > rarely, if that helps.
> >
> > --b.
> 
> Firstly, it's great to see this documentation and the well maintained wiki page
> for something we use in production (a lot) - thanks Bruce!
> 
> I can only draw on our experience to say:
> * if the nfs re-export server doesn't crash, we rarely have cause to reboot it.
> * we re-export read-only software repositories to WAN/cloud instances (an
> ideal use case).
> * we also re-export read/write production storage but every client process is
> writing unique files - there are no writes to the same
> file(s) from any clients of the re-export server.
> 
> We don't use or need crossmnt functionality, but I know from chatting to others
> within our industry that the fsid/crossmnt limitation causes them the most grief
> and confusion. I think in the case of Netapps, there are similar problems with
> trying to re-export a namespace made up of different volumes?
> 
> As noted on the wiki, the only way around that is probably to have a "proxy"
> mode (similar to what ganesha does?).

I'm glad to see this documentation also. It helps to have a common understanding of the challenges of re-export.

Ganesha's proxy mode would be different from what Bruce is proposing due to how Ganesha constructs handles. It adds an export ID to the export configuration which leads to the specific back end (FSAL) that is exporting that handle. It may or may not also encode an fsid into the handle (proxy does so only to the extent the re-exported server put an fsid into the handle, Ganesha's proxy doesn't distinguish between the filesystems on the re-exported server). Ganesha can proxy and serve local file systems because it's proxy isn't a re-export of an NFS mount, but a separate back end module (two actually, one for NFSv3 and one for NFSv4) that talk directly to the proxied server while local file systems resolve handles pretty much the same way knfsd does (use the fsid to find the right filesystem and then use open_by_handle to find the inode within that filesystem - open_by_handle is a user space interface to the same exportfs interface that knfsd uses, with all the same limitations).

What Bruce is proposing is a bit more like a proxy mode I have considered for Ganesha that would allow both a proxied Ganesha server and the Ganesha proxy to use the same handles. Basically the export IDs would be shared between the two servers, though even that could be constructed in a way to allow multiple proxied servers (they just would be required to have non-overlapping export IDs) as well as still export local file systems.

Frank
Frank Filz Sept. 22, 2021, 2:40 p.m. UTC | #6
> On Wed, Sep 22, 2021 at 10:47:35AM +0100, Daire Byrne wrote:
> > We don't use or need crossmnt functionality, but I know from chatting
> > to others within our industry that the fsid/crossmnt limitation causes
> > them the most grief and confusion. I think in the case of Netapps,
> > there are similar problems with trying to re-export a namespace made
> > up of different volumes?
> >
> > As noted on the wiki, the only way around that is probably to have a
> > "proxy" mode (similar to what ganesha does?).
> 
> I'm not sure what Ganesha does.  I spent some time thinking about and
couldn't
> figure out how to do it, at least not on my own in a reasonable amount of
time.
> 
> I liked the idea of limiting the proxy to reexport only one original
server and
> reusing the filehandles from the original server without any wrapping.
That
> addresses the fsid/crossmnt limitation and filehandle length limitations.
It
> means proxies all share the same filehandles so eventually you could also
> implement migration and failover between them and the original server.
> 
> It means when you get a filehandle the only way to find out *anything*
about it-
> -even what filesystem it's from--is to go ask the server.
> That's slow, so you need a filehandle cache.  You have to deal with the
case
> where you get a filehandle for an object that isn't mounted yet.
> Its parents may not be mounted yet either.  If it's a regular file you
can't call
> LOOKUPP on it.  I'm not sure how to handle the vfs details in that
case--how do
> you fake up a superblock and vfsmount?

That really highlights how Ganesha's proxy is different than anything knfsd
would do. Ganesha doesn't have to deal with all the vfs stuff. It just needs
to be able to reflect the original server's handle back to the original
server. Ganesha ONLY supports NFSv3 proxy (without file open) and NFSv4.1+
proxy since it needs open by handle. Ganesha then lives with totally
floating inodes. They may get connected in a name space as a result of
READDIR or LOOKUP but it never needs the name space. Changing knfsd to NOT
need to hook the proxied server's handles into vfs would just result in a
kernel implementation of something more like Ganesha and probably not worth
it.

> Simpler might be to give up on that idea of reusing the original server's
> filehandle, and automatically generate and persistently store uuids for
new
> filesystems as you discover them.

Ganesha does have a mode for proxy where it uses a database to map original
server handles and Ganesha handles. That helps get around file handle size
problems but it's not a great solution because you have to keep a persistent
data base or lose it all when the proxy crashes.

Frank
diff mbox series

Patch

diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
index 65805624e39b..288d8ddb2bc6 100644
--- a/Documentation/filesystems/nfs/index.rst
+++ b/Documentation/filesystems/nfs/index.rst
@@ -11,3 +11,4 @@  NFS
    rpc-server-gss
    nfs41-server
    knfsd-stats
+   reexport
diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
new file mode 100644
index 000000000000..892cb1e9c45c
--- /dev/null
+++ b/Documentation/filesystems/nfs/reexport.rst
@@ -0,0 +1,110 @@ 
+Reexporting NFS filesystems
+===========================
+
+Overview
+--------
+
+It is possible to reexport an NFS filesystem over NFS.  However, this
+feature comes with a number of limitations.  Before trying it, we
+recommend some careful research to determine wether it will work for
+your purposes.
+
+A discussion of current known limitations follows.
+
+"fsid=" required, crossmnt broken
+---------------------------------
+
+We require the "fsid=" export option on any reexport of an NFS
+filesystem.
+
+The "crossmnt" export option does not work in the reexport case.
+
+Reboot recovery
+---------------
+
+The NFS protocol's normal reboot recovery mechanisms don't work for the
+case when the reexport server reboots.  Clients will lose any locks
+they held before the reboot, and further IO will result in errors.
+Closing and reopening files should clear the errors.
+
+Filehandle limits
+-----------------
+
+If the original server uses an X byte filehandle for a given object, the
+reexport server's filehandle for the reexported object will be X+22
+bytes, rounded up to the nearest multiple of four bytes.
+
+The result must fit into the RFC-mandated filehandle size limits:
+
++-------+-----------+
+| NFSv2 |  32 bytes |
++-------+-----------+
+| NFSv3 |  64 bytes |
++-------+-----------+
+| NFSv4 | 128 bytes |
++-------+-----------+
+
+So, for example, you will only be able to reexport a filesystem over
+NFSv2 if the original server gives you filehandles that fit in 10
+bytes--which is unlikely.
+
+In general there's no way to know the maximum filehandle size given out
+by an NFS server without asking the server vendor.
+
+But the following table gives a few examples.  The first column is the
+typical length of the filehandle from a Linux server exporting the given
+filesystem, the second is the length after that nfs export is reexported
+by another Linux host:
+
++--------+-------------------+----------------+
+|        | filehandle length | after reexport |
++========+===================+================+
+| ext4:  | 28 bytes          | 52 bytes       |
++--------+-------------------+----------------+
+| xfs:   | 32 bytes          | 56 bytes       |
++--------+-------------------+----------------+
+| btrfs: | 40 bytes          | 64 bytes       |
++--------+-------------------+----------------+
+
+All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport,
+but none are reexportable over NFSv2.
+
+Linux server filehandles are a bit more complicated than this, though;
+for example:
+
+        - The (non-default) "subtreecheck" export option generally
+          requires another 4 to 8 bytes in the filehandle.
+        - If you export a subdirectory of a filesystem (instead of
+          exporting the filesystem root), that also usually adds 4 to 8
+          bytes.
+        - If you export over NFSv2, knfsd usually uses a shorter
+          filesystem identifier that saves 8 bytes.
+        - The root directory of an export uses a filehandle that is
+          shorter.
+
+As you can see, the 128-byte NFSv4 filehandle is large enough that
+you're unlikely to have trouble using NFSv4 to reexport any filesystem
+exported from a Linux server.  In general, if the original server is
+something that also supports NFSv3, you're *probably* OK.  Re-exporting
+over NFSv3 may be dicier, and reexporting over NFSv2 will probably
+never work.
+
+For more details of Linux filehandle structure, the best reference is
+the source code and comments; see in particular:
+
+        - include/linux/exportfs.h:enum fid_type
+        - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new
+        - fs/nfsd/nfsfh.c:set_version_and_fsid_type
+        - fs/nfs/export.c:nfs_encode_fh
+
+Open DENY bits ignored
+----------------------
+
+NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which
+allow you, for example, to open a file in a mode which forbids other
+read opens or write opens. The Linux client doesn't use them, and the
+server's support has always been incomplete: they are enforced only
+against other NFS users, not against processes accessing the exported
+filesystem locally. A reexport server will also not pass them along to
+the original server, so they will not be enforced between clients of
+different reexport servers.