diff mbox

[3/8] mountd: remove 'dev_missing' checks

Message ID 20160714022643.5874.84409.stgit@noble (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown July 14, 2016, 2:26 a.m. UTC
I now think this was a mistaken idea.

If a filesystem is exported with the "mountpoint" or "mp" option, it
should only be exported if the directory is a mount point.  The
intention is that if there is a problem with one filesystem on a
server, the others can still be exported, but clients won't
incorrectly see the empty directory on the parent when accessing the
missing filesystem, they will see clearly that the filesystem is
missing.

It is easy to handle this correctly for NFSv3 MOUNT requests, but what
is the correct behavior if a client already has the filesystem mounted
and so has a filehandle?  Maybe the server rebooted and came back with
one device missing.  What should the client see?

The "dev_missing" code tries to detect this case and causes the server
to respond with silence rather than ESTALE.  The idea was that the
client would retry and when (or if) the filesystem came back, service
would be transparently restored.

The problem with this is that arbitrarily long delays are not what
people would expect, and can be quite annoying.  ESTALE, while
unpleasant, it at least easily understood.  A device disappearing is a
fairly significant event and hiding it doesn't really serve anyone.

So: remove the code and allow ESTALE.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 utils/mountd/cache.c |   12 ------------
 1 file changed, 12 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

J. Bruce Fields July 18, 2016, 8:01 p.m. UTC | #1
On Thu, Jul 14, 2016 at 12:26:43PM +1000, NeilBrown wrote:
> I now think this was a mistaken idea.
> 
> If a filesystem is exported with the "mountpoint" or "mp" option, it
> should only be exported if the directory is a mount point.  The
> intention is that if there is a problem with one filesystem on a
> server, the others can still be exported, but clients won't
> incorrectly see the empty directory on the parent when accessing the
> missing filesystem, they will see clearly that the filesystem is
> missing.
> 
> It is easy to handle this correctly for NFSv3 MOUNT requests, but what
> is the correct behavior if a client already has the filesystem mounted
> and so has a filehandle?  Maybe the server rebooted and came back with
> one device missing.  What should the client see?
> 
> The "dev_missing" code tries to detect this case and causes the server
> to respond with silence rather than ESTALE.  The idea was that the
> client would retry and when (or if) the filesystem came back, service
> would be transparently restored.
> 
> The problem with this is that arbitrarily long delays are not what
> people would expect, and can be quite annoying.  ESTALE, while
> unpleasant, it at least easily understood.  A device disappearing is a
> fairly significant event and hiding it doesn't really serve anyone.

It could also be a filesystem disappearing because it failed to mount in
time on a reboot.

> So: remove the code and allow ESTALE.

I'm not completely sure I understand the justification.

I don't like the current behavior either--I'd be happy if we could
deprecate "mountpoint" entirely--but changing it now would seem to risk
regressions if anyone depends on it.

--b.

> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  utils/mountd/cache.c |   12 ------------
>  1 file changed, 12 deletions(-)
> 
> diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
> index ec86a22613cf..346a8b3af8b5 100644
> --- a/utils/mountd/cache.c
> +++ b/utils/mountd/cache.c
> @@ -687,7 +687,6 @@ static void nfsd_fh(int f)
>  	char *found_path = NULL;
>  	nfs_export *exp;
>  	int i;
> -	int dev_missing = 0;
>  	char buf[RPC_CHAN_BUF_SIZE], *bp;
>  	int blen;
>  
> @@ -754,11 +753,6 @@ static void nfsd_fh(int f)
>  			if (!is_ipaddr_client(dom)
>  					&& !namelist_client_matches(exp, dom))
>  				continue;
> -			if (exp->m_export.e_mountpoint &&
> -			    !is_mountpoint(exp->m_export.e_mountpoint[0]?
> -					   exp->m_export.e_mountpoint:
> -					   exp->m_export.e_path))
> -				dev_missing ++;
>  
>  			if (!match_fsid(&parsed, exp, path))
>  				continue;
> @@ -801,12 +795,6 @@ static void nfsd_fh(int f)
>  		/* FIXME we need to make sure we re-visit this later */
>  		goto out;
>  	}
> -	if (!found && dev_missing) {
> -		/* The missing dev could be what we want, so just be
> -		 * quite rather than returning stale yet
> -		 */
> -		goto out;
> -	}
>  
>  	if (found)
>  		if (cache_export_ent(buf, sizeof(buf), dom, found, found_path) < 0)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 19, 2016, 10:50 p.m. UTC | #2
On Tue, Jul 19 2016, J. Bruce Fields wrote:

> On Thu, Jul 14, 2016 at 12:26:43PM +1000, NeilBrown wrote:
>> I now think this was a mistaken idea.
>> 
>> If a filesystem is exported with the "mountpoint" or "mp" option, it
>> should only be exported if the directory is a mount point.  The
>> intention is that if there is a problem with one filesystem on a
>> server, the others can still be exported, but clients won't
>> incorrectly see the empty directory on the parent when accessing the
>> missing filesystem, they will see clearly that the filesystem is
>> missing.
>> 
>> It is easy to handle this correctly for NFSv3 MOUNT requests, but what
>> is the correct behavior if a client already has the filesystem mounted
>> and so has a filehandle?  Maybe the server rebooted and came back with
>> one device missing.  What should the client see?
>> 
>> The "dev_missing" code tries to detect this case and causes the server
>> to respond with silence rather than ESTALE.  The idea was that the
>> client would retry and when (or if) the filesystem came back, service
>> would be transparently restored.
>> 
>> The problem with this is that arbitrarily long delays are not what
>> people would expect, and can be quite annoying.  ESTALE, while
>> unpleasant, it at least easily understood.  A device disappearing is a
>> fairly significant event and hiding it doesn't really serve anyone.
>
> It could also be a filesystem disappearing because it failed to mount in
> time on a reboot.

I don't think "in time" is really an issue.  Boot sequencing should not
start nfsd until everything in /etc/fstab is mounted, has failed and the
failure has been deemed acceptable.
That is why nfs-server.services has "After= local-fs.target"

>
>> So: remove the code and allow ESTALE.
>
> I'm not completely sure I understand the justification.

"hangs are bad".

When you cannot get a reply from the NFS server there are multiple
possible causes from temporary network glitch to server-is-dead.
You cannot reliably differentiate, so you have to just wait.

There server itself doesn't have the same uncertainty about its exported
filesystems.  They are either working or they aren't.
So it is possible, and I think reasonable, to send a more definitive
reply - ESTALE.

This particularly became an issues with NFSv4.
With NFSv3, mounting the filesystem is distinct from accessing it.
So it was easy to fail a mount request but delay an access request.
With NFSv4 we don't have that distinction.  If we make accesses wait,
then we make mount attempts wait too, which isn't at all friendly.

>
> I don't like the current behavior either--I'd be happy if we could
> deprecate "mountpoint" entirely--but changing it now would seem to risk
> regressions if anyone depends on it.

True.  There isn't really an easy solution there.

"mountpoint" seemed like a good idea when I wrote it.  But I never got
any proper peer review.

Thanks,
NeilBrown
J. Bruce Fields July 21, 2016, 5:24 p.m. UTC | #3
On Wed, Jul 20, 2016 at 08:50:12AM +1000, NeilBrown wrote:
> On Tue, Jul 19 2016, J. Bruce Fields wrote:
> 
> > On Thu, Jul 14, 2016 at 12:26:43PM +1000, NeilBrown wrote:
> >> I now think this was a mistaken idea.
> >> 
> >> If a filesystem is exported with the "mountpoint" or "mp" option, it
> >> should only be exported if the directory is a mount point.  The
> >> intention is that if there is a problem with one filesystem on a
> >> server, the others can still be exported, but clients won't
> >> incorrectly see the empty directory on the parent when accessing the
> >> missing filesystem, they will see clearly that the filesystem is
> >> missing.
> >> 
> >> It is easy to handle this correctly for NFSv3 MOUNT requests, but what
> >> is the correct behavior if a client already has the filesystem mounted
> >> and so has a filehandle?  Maybe the server rebooted and came back with
> >> one device missing.  What should the client see?
> >> 
> >> The "dev_missing" code tries to detect this case and causes the server
> >> to respond with silence rather than ESTALE.  The idea was that the
> >> client would retry and when (or if) the filesystem came back, service
> >> would be transparently restored.
> >> 
> >> The problem with this is that arbitrarily long delays are not what
> >> people would expect, and can be quite annoying.  ESTALE, while
> >> unpleasant, it at least easily understood.  A device disappearing is a
> >> fairly significant event and hiding it doesn't really serve anyone.
> >
> > It could also be a filesystem disappearing because it failed to mount in
> > time on a reboot.
> 
> I don't think "in time" is really an issue.  Boot sequencing should not
> start nfsd until everything in /etc/fstab is mounted, has failed and the
> failure has been deemed acceptable.
> That is why nfs-server.services has "After= local-fs.target"

Yeah, I agree, that's the right way to do it.  I believe the old
behavior would be forgiving of misconfiguration here, though, which
means there's a chance somebody would witness a failure on upgrade.
Maybe the chance is small.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown Aug. 11, 2016, 2:51 a.m. UTC | #4
On Fri, Jul 22 2016, J. Bruce Fields wrote:

> On Wed, Jul 20, 2016 at 08:50:12AM +1000, NeilBrown wrote:
>> On Tue, Jul 19 2016, J. Bruce Fields wrote:
>> 
>> > On Thu, Jul 14, 2016 at 12:26:43PM +1000, NeilBrown wrote:
>> >> I now think this was a mistaken idea.
>> >> 
>> >> If a filesystem is exported with the "mountpoint" or "mp" option, it
>> >> should only be exported if the directory is a mount point.  The
>> >> intention is that if there is a problem with one filesystem on a
>> >> server, the others can still be exported, but clients won't
>> >> incorrectly see the empty directory on the parent when accessing the
>> >> missing filesystem, they will see clearly that the filesystem is
>> >> missing.
>> >> 
>> >> It is easy to handle this correctly for NFSv3 MOUNT requests, but what
>> >> is the correct behavior if a client already has the filesystem mounted
>> >> and so has a filehandle?  Maybe the server rebooted and came back with
>> >> one device missing.  What should the client see?
>> >> 
>> >> The "dev_missing" code tries to detect this case and causes the server
>> >> to respond with silence rather than ESTALE.  The idea was that the
>> >> client would retry and when (or if) the filesystem came back, service
>> >> would be transparently restored.
>> >> 
>> >> The problem with this is that arbitrarily long delays are not what
>> >> people would expect, and can be quite annoying.  ESTALE, while
>> >> unpleasant, it at least easily understood.  A device disappearing is a
>> >> fairly significant event and hiding it doesn't really serve anyone.
>> >
>> > It could also be a filesystem disappearing because it failed to mount in
>> > time on a reboot.
>> 
>> I don't think "in time" is really an issue.  Boot sequencing should not
>> start nfsd until everything in /etc/fstab is mounted, has failed and the
>> failure has been deemed acceptable.
>> That is why nfs-server.services has "After= local-fs.target"
>
> Yeah, I agree, that's the right way to do it.  [snip]

There is actually more here ... don't you love getting drip-feed
symptoms and requirements :-)

It turns out the the customer is NFS-exporting a filesystem mounted
using iSCSI.  Such filesystems are treated by systemd as "network"
filesystem, which seems at least a little bit reasonable.
So it is "remote-fs" that applies, or more accurately
"remote-fs-pre.target"
And nfs-server.services contains:

Before=remote-fs-pre.target

So nfsd is likely to start up before the iSCSI filesystems are mounted.

The customer tried to stop this bt using a systemd drop-in to add
RequiresMountsFor= for the remote filesystem, but that causes
a loop with the Before=remote-fs-pre.target.

I don't think we need this line for sequencing start-up, but it does
seem to be useful for sequencing shutdown - so that nfs-server is
stopped after remote-fs-pre, which is stopped after things are
unmounted.
"useful", but not "right".  This doesn't stop remote servers from
shutting down in the wrong order.
We should probably remove this line and teach systemd to use "umount -f"
which doesn't block when the server is down.  If systemd just used a
script, that would be easy....

I'm not 100% certain that "umount -f" is correct.  We just want to stop
umount from stating the mountpoint, we don't need to send MNT_FORCE.
I sometimes think it would be really nice if NFS didn't block a
'getattr' request of the mountpoint.  That would remove some pain from
unmount and other places where the server was down, but probably would
cause other problem.

Does anyone have any opinions on the best way to make sure systemd
doesn't hang when it tries to umount a filesystem from an unresponsive
server?  Is "-f" best, or something else.



There is another issue related to this that I've been meaning to
mention.  It related to the start-up ordering rather than shut down.

When you try to mount an NFS filesystem and the server isn't responding,
mount.nfs retries for a little while and then - if "bg" is given - it
forks and retries a bit longer.
While it keeps gets failures that appear temporary, like ECONNREFUSED or
ETIMEDOUT (see nfs_is_permanent_error()) it keeps retrying.

There is typically a window between when rpcbind starts responding to
queries, and when nfsd has registered with it.  If mount.nfs sends an
rpcbind query in this window. It gets RPC_PROGNOTREGISTERED which
nfs_rewrite_pmap_mount_options maps to EOPNOTSUPP, and
nfs_is_permanent_error() thinks that is a permanent error.

Strangely, when the 'bg' option is used, there is special-case code
Commit: bf66c9facb8e ("mounts.nfs: v2 and v3 background mounts should retry when server is down.")
to stop EOPNOTSUPP from being a permanent error.

Do people think it would be reasonable to make it a transient error for
foreground mounts too?

Thanks,
NeilBrown
J. Bruce Fields Aug. 16, 2016, 3:21 p.m. UTC | #5
On Thu, Aug 11, 2016 at 12:51:45PM +1000, NeilBrown wrote:
> On Fri, Jul 22 2016, J. Bruce Fields wrote:
> 
> > On Wed, Jul 20, 2016 at 08:50:12AM +1000, NeilBrown wrote:
> >> On Tue, Jul 19 2016, J. Bruce Fields wrote:
> >> 
> >> > On Thu, Jul 14, 2016 at 12:26:43PM +1000, NeilBrown wrote:
> >> >> I now think this was a mistaken idea.
> >> >> 
> >> >> If a filesystem is exported with the "mountpoint" or "mp" option, it
> >> >> should only be exported if the directory is a mount point.  The
> >> >> intention is that if there is a problem with one filesystem on a
> >> >> server, the others can still be exported, but clients won't
> >> >> incorrectly see the empty directory on the parent when accessing the
> >> >> missing filesystem, they will see clearly that the filesystem is
> >> >> missing.
> >> >> 
> >> >> It is easy to handle this correctly for NFSv3 MOUNT requests, but what
> >> >> is the correct behavior if a client already has the filesystem mounted
> >> >> and so has a filehandle?  Maybe the server rebooted and came back with
> >> >> one device missing.  What should the client see?
> >> >> 
> >> >> The "dev_missing" code tries to detect this case and causes the server
> >> >> to respond with silence rather than ESTALE.  The idea was that the
> >> >> client would retry and when (or if) the filesystem came back, service
> >> >> would be transparently restored.
> >> >> 
> >> >> The problem with this is that arbitrarily long delays are not what
> >> >> people would expect, and can be quite annoying.  ESTALE, while
> >> >> unpleasant, it at least easily understood.  A device disappearing is a
> >> >> fairly significant event and hiding it doesn't really serve anyone.
> >> >
> >> > It could also be a filesystem disappearing because it failed to mount in
> >> > time on a reboot.
> >> 
> >> I don't think "in time" is really an issue.  Boot sequencing should not
> >> start nfsd until everything in /etc/fstab is mounted, has failed and the
> >> failure has been deemed acceptable.
> >> That is why nfs-server.services has "After= local-fs.target"
> >
> > Yeah, I agree, that's the right way to do it.  [snip]
> 
> There is actually more here ... don't you love getting drip-feed
> symptoms and requirements :-)

It's all good.

> It turns out the the customer is NFS-exporting a filesystem mounted
> using iSCSI.  Such filesystems are treated by systemd as "network"
> filesystem, which seems at least a little bit reasonable.
> So it is "remote-fs" that applies, or more accurately
> "remote-fs-pre.target"
> And nfs-server.services contains:
> 
> Before=remote-fs-pre.target

This is to handle the loopback case?

In which case what it really wants to say is "before nfs mounts" (or
even "before nfs mounts of localhost"; and vice versa on shutdown).  I
can't tell if there's an easy way to get say that.

> So nfsd is likely to start up before the iSCSI filesystems are mounted.
> 
> The customer tried to stop this bt using a systemd drop-in to add
> RequiresMountsFor= for the remote filesystem, but that causes
> a loop with the Before=remote-fs-pre.target.
> 
> I don't think we need this line for sequencing start-up, but it does
> seem to be useful for sequencing shutdown - so that nfs-server is
> stopped after remote-fs-pre, which is stopped after things are
> unmounted.
> "useful", but not "right".  This doesn't stop remote servers from
> shutting down in the wrong order.
> We should probably remove this line and teach systemd to use "umount -f"
> which doesn't block when the server is down.  If systemd just used a
> script, that would be easy....
> 
> I'm not 100% certain that "umount -f" is correct.  We just want to stop
> umount from stating the mountpoint, we don't need to send MNT_FORCE.
> I sometimes think it would be really nice if NFS didn't block a
> 'getattr' request of the mountpoint.  That would remove some pain from
> unmount and other places where the server was down, but probably would
> cause other problem.

I thought I remembered Jeff Layton doing some work to remove the need to
revalidate the mountpoint on unmount in recent kernels.

Is that the only risk, though?  Maybe so--presumably you've killed any
users, so any write data associated with opens should be flushed.  And
if you do a sync after that you take care of write delegations too.

> Does anyone have any opinions on the best way to make sure systemd
> doesn't hang when it tries to umount a filesystem from an unresponsive
> server?  Is "-f" best, or something else.
> 
> 
> 
> There is another issue related to this that I've been meaning to
> mention.  It related to the start-up ordering rather than shut down.
> 
> When you try to mount an NFS filesystem and the server isn't responding,
> mount.nfs retries for a little while and then - if "bg" is given - it
> forks and retries a bit longer.
> While it keeps gets failures that appear temporary, like ECONNREFUSED or
> ETIMEDOUT (see nfs_is_permanent_error()) it keeps retrying.
> 
> There is typically a window between when rpcbind starts responding to
> queries, and when nfsd has registered with it.  If mount.nfs sends an
> rpcbind query in this window. It gets RPC_PROGNOTREGISTERED which
> nfs_rewrite_pmap_mount_options maps to EOPNOTSUPP, and
> nfs_is_permanent_error() thinks that is a permanent error.

Looking at rpcbind(8)....  Shouldn't "-w" prevent this by loading some
registrations before it starts responding to requests?

--b.

> 
> Strangely, when the 'bg' option is used, there is special-case code
> Commit: bf66c9facb8e ("mounts.nfs: v2 and v3 background mounts should retry when server is down.")
> to stop EOPNOTSUPP from being a permanent error.
> 
> Do people think it would be reasonable to make it a transient error for
> foreground mounts too?
> 
> Thanks,
> NeilBrown


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown Aug. 18, 2016, 1:32 a.m. UTC | #6
On Wed, Aug 17 2016, J. Bruce Fields wrote:
>
>> It turns out the the customer is NFS-exporting a filesystem mounted
>> using iSCSI.  Such filesystems are treated by systemd as "network"
>> filesystem, which seems at least a little bit reasonable.
>> So it is "remote-fs" that applies, or more accurately
>> "remote-fs-pre.target"
>> And nfs-server.services contains:
>> 
>> Before=remote-fs-pre.target
>
> This is to handle the loopback case?

That is what the git-commit says.  Specifically to handle the
shutdown/unmount side.

>
> In which case what it really wants to say is "before nfs mounts" (or
> even "before nfs mounts of localhost"; and vice versa on shutdown).  I
> can't tell if there's an easy way to get say that.

I'd be happy with a difficult/complex way, if it was reliable.
Could we write a systemd generator which parses /etc/fstab, determines
all mount points which a loop-back NFS mounts (or even just any NFS
mounts) and creates a drop-in for nfs-server which adds
  Before=mount-point.mount
for each /mount/point.

Could that be reliable?  I might try.


Another generator could process /etc/exports and add
   RequiresMountsFor=/export/point
for every export point.  Maybe skip export points with the "mountpoint"
option as mountd expects those to possibly appear later.
   
>
>> So nfsd is likely to start up before the iSCSI filesystems are mounted.
>> 
>> The customer tried to stop this bt using a systemd drop-in to add
>> RequiresMountsFor= for the remote filesystem, but that causes
>> a loop with the Before=remote-fs-pre.target.
>> 
>> I don't think we need this line for sequencing start-up, but it does
>> seem to be useful for sequencing shutdown - so that nfs-server is
>> stopped after remote-fs-pre, which is stopped after things are
>> unmounted.
>> "useful", but not "right".  This doesn't stop remote servers from
>> shutting down in the wrong order.
>> We should probably remove this line and teach systemd to use "umount -f"
>> which doesn't block when the server is down.  If systemd just used a
>> script, that would be easy....
>> 
>> I'm not 100% certain that "umount -f" is correct.  We just want to stop
>> umount from stating the mountpoint, we don't need to send MNT_FORCE.
>> I sometimes think it would be really nice if NFS didn't block a
>> 'getattr' request of the mountpoint.  That would remove some pain from
>> unmount and other places where the server was down, but probably would
>> cause other problem.
>
> I thought I remembered Jeff Layton doing some work to remove the need to
> revalidate the mountpoint on unmount in recent kernels.

Jeff's work means that the umount systemcall won't revalidate the mount
point.  This is important and useful, but not sufficient.
/usr/bin/mount will 'lstat' the mountpoint (unless -f or -l is given).
That is what causes the problem.
mount mostly want to make sure it has a canonical name.  It should be
possible to get it to avoid the stat if the name can be determined to
be canonical some other way.  Just looking to see if it is in
/proc/mounts would work, but there are comments in the code about
/proc/mounts sometimes being very large, and not wanting to do that too
much.... need to look again.

>
> Is that the only risk, though?  Maybe so--presumably you've killed any
> users, so any write data associated with opens should be flushed.  And
> if you do a sync after that you take care of write delegations too.

In the easily reproducible case, all user processes are gone.
It would be worth checking what happens if processes are accessing a
filesystem from an unreachable server at shutdown.
"kill -9" should get rid of them all now, so it might be OK.
"sync" would hang though.  I'd be happy for that to cause a delay of a
minute or so, but hopefully systemd would (or could be told to) kill -9
a sync if it took too long.

>
>> Does anyone have any opinions on the best way to make sure systemd
>> doesn't hang when it tries to umount a filesystem from an unresponsive
>> server?  Is "-f" best, or something else.
>> 
>> 
>> 
>> There is another issue related to this that I've been meaning to
>> mention.  It related to the start-up ordering rather than shut down.
>> 
>> When you try to mount an NFS filesystem and the server isn't responding,
>> mount.nfs retries for a little while and then - if "bg" is given - it
>> forks and retries a bit longer.
>> While it keeps gets failures that appear temporary, like ECONNREFUSED or
>> ETIMEDOUT (see nfs_is_permanent_error()) it keeps retrying.
>> 
>> There is typically a window between when rpcbind starts responding to
>> queries, and when nfsd has registered with it.  If mount.nfs sends an
>> rpcbind query in this window. It gets RPC_PROGNOTREGISTERED which
>> nfs_rewrite_pmap_mount_options maps to EOPNOTSUPP, and
>> nfs_is_permanent_error() thinks that is a permanent error.
>
> Looking at rpcbind(8)....  Shouldn't "-w" prevent this by loading some
> registrations before it starts responding to requests?

"-w" (which isn't listed in the SYNOPSIS!) only applies to a warm-start
where the daemons which previously registered are still running.
The problem case is that the daemons haven't registered yet (so we don't
necessarily know what port number they will get).

To address the issue in rpcbind, we would need a flag to say "don't
respond to lookup requests, just accept registrations", then when all
registrations are complete, send some message to rpcbind to say "OK,
respond to lookups now".  That could even be done by killing and
restarting with "-w", though that it a bit ugly.

I'm leaning towards having mount retry after RPC_PROGNOTREGISTERED for
fg like it does with bg.

Thanks,
NeilBrown


>
> --b.
>
>> 
>> Strangely, when the 'bg' option is used, there is special-case code
>> Commit: bf66c9facb8e ("mounts.nfs: v2 and v3 background mounts should retry when server is down.")
>> to stop EOPNOTSUPP from being a permanent error.
>> 
>> Do people think it would be reasonable to make it a transient error for
>> foreground mounts too?
>> 
>> Thanks,
>> NeilBrown
Chuck Lever Aug. 18, 2016, 2:57 a.m. UTC | #7
> On Aug 17, 2016, at 9:32 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Wed, Aug 17 2016, J. Bruce Fields wrote:
>>> 
>>> 
>>> There is another issue related to this that I've been meaning to
>>> mention.  It related to the start-up ordering rather than shut down.
>>> 
>>> When you try to mount an NFS filesystem and the server isn't responding,
>>> mount.nfs retries for a little while and then - if "bg" is given - it
>>> forks and retries a bit longer.
>>> While it keeps gets failures that appear temporary, like ECONNREFUSED or
>>> ETIMEDOUT (see nfs_is_permanent_error()) it keeps retrying.
>>> 
>>> There is typically a window between when rpcbind starts responding to
>>> queries, and when nfsd has registered with it.  If mount.nfs sends an
>>> rpcbind query in this window. It gets RPC_PROGNOTREGISTERED which
>>> nfs_rewrite_pmap_mount_options maps to EOPNOTSUPP, and
>>> nfs_is_permanent_error() thinks that is a permanent error.
>> 
>> Looking at rpcbind(8)....  Shouldn't "-w" prevent this by loading some
>> registrations before it starts responding to requests?
> 
> "-w" (which isn't listed in the SYNOPSIS!) only applies to a warm-start
> where the daemons which previously registered are still running.
> The problem case is that the daemons haven't registered yet (so we don't
> necessarily know what port number they will get).
> 
> To address the issue in rpcbind, we would need a flag to say "don't
> respond to lookup requests, just accept registrations", then when all
> registrations are complete, send some message to rpcbind to say "OK,
> respond to lookups now".  That could even be done by killing and
> restarting with "-w", though that it a bit ugly.

An alternative would be to create a temporary firewall rule that
blocked port 111 to remote connections. Once local RPC services
had registered, the rule is removed.

Just a thought.


> I'm leaning towards having mount retry after RPC_PROGNOTREGISTERED for
> fg like it does with bg.

It probably should do that. If rpcbind is up, then the other
services are probably on their way.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Aug. 18, 2016, 1:57 p.m. UTC | #8
Not really arguing--I'll trust your judgement--just some random ideas:

On Thu, Aug 18, 2016 at 11:32:52AM +1000, NeilBrown wrote:
> On Wed, Aug 17 2016, J. Bruce Fields wrote:
> > In which case what it really wants to say is "before nfs mounts" (or
> > even "before nfs mounts of localhost"; and vice versa on shutdown).  I
> > can't tell if there's an easy way to get say that.
> 
> I'd be happy with a difficult/complex way, if it was reliable.
> Could we write a systemd generator which parses /etc/fstab, determines
> all mount points which a loop-back NFS mounts (or even just any NFS
> mounts) and creates a drop-in for nfs-server which adds
>   Before=mount-point.mount
> for each /mount/point.
> 
> Could that be reliable?  I might try.

Digging around... we've also got this callout from mount to start-statd,
can we use something like that to make loopback nfs mounts wait on nfs
server startup?

> > Is that the only risk, though?  Maybe so--presumably you've killed any
> > users, so any write data associated with opens should be flushed.  And
> > if you do a sync after that you take care of write delegations too.
> 
> In the easily reproducible case, all user processes are gone.
> It would be worth checking what happens if processes are accessing a
> filesystem from an unreachable server at shutdown.
> "kill -9" should get rid of them all now, so it might be OK.
> "sync" would hang though.  I'd be happy for that to cause a delay of a
> minute or so, but hopefully systemd would (or could be told to) kill -9
> a sync if it took too long.

We shouldn't have to resort to that in the loopback nfs case, where we
control ordering.  So in that case, I'm just pointing out that:

	kill -9 all users of the filesystem
	shutdown nfs server
	umount nfs filesystems

isn't the right ordering, because in the presence of write delegations
there could still be writeback data.

(OK, actually, knfsd doesn't currently implement write delegations--but
we shouldn't depend on that assumption.)

Adding a sync between the first two steps might help, though the write
delegations themselves could still linger, and I don't know how the
client will behave when it finds it can't return them.

So it'd be nice if we could just order the umount before the server
shutdown.

The case of a remote server shut down too early is different of course.

> > Looking at rpcbind(8)....  Shouldn't "-w" prevent this by loading some
> > registrations before it starts responding to requests?
> 
> "-w" (which isn't listed in the SYNOPSIS!) only applies to a warm-start
> where the daemons which previously registered are still running.
> The problem case is that the daemons haven't registered yet (so we don't
> necessarily know what port number they will get).

We probably know the port in the specific case of nfsd, and could fake
up rpcbind's state file if necessary.  Eh, your idea's not as bad:

> To address the issue in rpcbind, we would need a flag to say "don't
> respond to lookup requests, just accept registrations", then when all
> registrations are complete, send some message to rpcbind to say "OK,
> respond to lookups now".  That could even be done by killing and
> restarting with "-w", though that it a bit ugly.
> 
> I'm leaning towards having mount retry after RPC_PROGNOTREGISTERED for
> fg like it does with bg.

Anyway, sounds OK to me.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown Aug. 19, 2016, 1:28 a.m. UTC | #9
On Thu, Aug 18 2016, J. Bruce Fields wrote:

> Not really arguing--I'll trust your judgement--just some random ideas:
>
> On Thu, Aug 18, 2016 at 11:32:52AM +1000, NeilBrown wrote:
>> On Wed, Aug 17 2016, J. Bruce Fields wrote:
>> > In which case what it really wants to say is "before nfs mounts" (or
>> > even "before nfs mounts of localhost"; and vice versa on shutdown).  I
>> > can't tell if there's an easy way to get say that.
>> 
>> I'd be happy with a difficult/complex way, if it was reliable.
>> Could we write a systemd generator which parses /etc/fstab, determines
>> all mount points which a loop-back NFS mounts (or even just any NFS
>> mounts) and creates a drop-in for nfs-server which adds
>>   Before=mount-point.mount
>> for each /mount/point.
>> 
>> Could that be reliable?  I might try.
>
> Digging around... we've also got this callout from mount to start-statd,
> can we use something like that to make loopback nfs mounts wait on nfs
> server startup?

An nfs mount already waits for the server to start up.  The ordering
dependency between NFS mounts and the nfs-server only really matters at
shutdown, and we cannot enhance mount.nfs to wait for a negative amount
of time (also known as "time travel")

>
>> > Is that the only risk, though?  Maybe so--presumably you've killed any
>> > users, so any write data associated with opens should be flushed.  And
>> > if you do a sync after that you take care of write delegations too.
>> 
>> In the easily reproducible case, all user processes are gone.
>> It would be worth checking what happens if processes are accessing a
>> filesystem from an unreachable server at shutdown.
>> "kill -9" should get rid of them all now, so it might be OK.
>> "sync" would hang though.  I'd be happy for that to cause a delay of a
>> minute or so, but hopefully systemd would (or could be told to) kill -9
>> a sync if it took too long.
>
> We shouldn't have to resort to that in the loopback nfs case, where we
> control ordering.  So in that case, I'm just pointing out that:
>
> 	kill -9 all users of the filesystem
> 	shutdown nfs server
> 	umount nfs filesystems
>
> isn't the right ordering, because in the presence of write delegations
> there could still be writeback data.

Yes, that does make a good case for getting the ordering right, rather
than just getting the shutdown-sequence not to block.  Thanks,

>
> (OK, actually, knfsd doesn't currently implement write delegations--but
> we shouldn't depend on that assumption.)
>
> Adding a sync between the first two steps might help, though the write
> delegations themselves could still linger, and I don't know how the
> client will behave when it finds it can't return them.
>
> So it'd be nice if we could just order the umount before the server
> shutdown.
>
> The case of a remote server shut down too early is different of course.
>
>> > Looking at rpcbind(8)....  Shouldn't "-w" prevent this by loading some
>> > registrations before it starts responding to requests?
>> 
>> "-w" (which isn't listed in the SYNOPSIS!) only applies to a warm-start
>> where the daemons which previously registered are still running.
>> The problem case is that the daemons haven't registered yet (so we don't
>> necessarily know what port number they will get).
>
> We probably know the port in the specific case of nfsd, and could fake
> up rpcbind's state file if necessary.  Eh, your idea's not as bad:
>
>> To address the issue in rpcbind, we would need a flag to say "don't
>> respond to lookup requests, just accept registrations", then when all
>> registrations are complete, send some message to rpcbind to say "OK,
>> respond to lookups now".  That could even be done by killing and
>> restarting with "-w", though that it a bit ugly.
>> 
>> I'm leaning towards having mount retry after RPC_PROGNOTREGISTERED for
>> fg like it does with bg.
>
> Anyway, sounds OK to me.

Thanks,
NeilBrown
NeilBrown Aug. 19, 2016, 1:31 a.m. UTC | #10
On Thu, Aug 18 2016, Chuck Lever wrote:

>> On Aug 17, 2016, at 9:32 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Wed, Aug 17 2016, J. Bruce Fields wrote:
>>>> 
>>>> 
>>>> There is another issue related to this that I've been meaning to
>>>> mention.  It related to the start-up ordering rather than shut down.
>>>> 
>>>> When you try to mount an NFS filesystem and the server isn't responding,
>>>> mount.nfs retries for a little while and then - if "bg" is given - it
>>>> forks and retries a bit longer.
>>>> While it keeps gets failures that appear temporary, like ECONNREFUSED or
>>>> ETIMEDOUT (see nfs_is_permanent_error()) it keeps retrying.
>>>> 
>>>> There is typically a window between when rpcbind starts responding to
>>>> queries, and when nfsd has registered with it.  If mount.nfs sends an
>>>> rpcbind query in this window. It gets RPC_PROGNOTREGISTERED which
>>>> nfs_rewrite_pmap_mount_options maps to EOPNOTSUPP, and
>>>> nfs_is_permanent_error() thinks that is a permanent error.
>>> 
>>> Looking at rpcbind(8)....  Shouldn't "-w" prevent this by loading some
>>> registrations before it starts responding to requests?
>> 
>> "-w" (which isn't listed in the SYNOPSIS!) only applies to a warm-start
>> where the daemons which previously registered are still running.
>> The problem case is that the daemons haven't registered yet (so we don't
>> necessarily know what port number they will get).
>> 
>> To address the issue in rpcbind, we would need a flag to say "don't
>> respond to lookup requests, just accept registrations", then when all
>> registrations are complete, send some message to rpcbind to say "OK,
>> respond to lookups now".  That could even be done by killing and
>> restarting with "-w", though that it a bit ugly.
>
> An alternative would be to create a temporary firewall rule that
> blocked port 111 to remote connections. Once local RPC services
> had registered, the rule is removed.
>
> Just a thought.

Interesting ..... probably the sort of thing that I would resort to if I
really needed to fix this and didn't have any source code.
But fiddling with fire-wall rules is not one of my favourite things so I
think I stick with a more focussed solution.
Thanks!

NeilBrown


>
>
>> I'm leaning towards having mount retry after RPC_PROGNOTREGISTERED for
>> fg like it does with bg.
>
> It probably should do that. If rpcbind is up, then the other
> services are probably on their way.
>
>
> --
> Chuck Lever
J. Bruce Fields Aug. 19, 2016, 5:27 p.m. UTC | #11
On Fri, Aug 19, 2016 at 11:28:30AM +1000, NeilBrown wrote:
> On Thu, Aug 18 2016, J. Bruce Fields wrote:
> 
> > Not really arguing--I'll trust your judgement--just some random ideas:
> >
> > On Thu, Aug 18, 2016 at 11:32:52AM +1000, NeilBrown wrote:
> >> On Wed, Aug 17 2016, J. Bruce Fields wrote:
> >> > In which case what it really wants to say is "before nfs mounts" (or
> >> > even "before nfs mounts of localhost"; and vice versa on shutdown).  I
> >> > can't tell if there's an easy way to get say that.
> >> 
> >> I'd be happy with a difficult/complex way, if it was reliable.
> >> Could we write a systemd generator which parses /etc/fstab, determines
> >> all mount points which a loop-back NFS mounts (or even just any NFS
> >> mounts) and creates a drop-in for nfs-server which adds
> >>   Before=mount-point.mount
> >> for each /mount/point.
> >> 
> >> Could that be reliable?  I might try.
> >
> > Digging around... we've also got this callout from mount to start-statd,
> > can we use something like that to make loopback nfs mounts wait on nfs
> > server startup?
> 
> An nfs mount already waits for the server to start up.  The ordering
> dependency between NFS mounts and the nfs-server only really matters at
> shutdown, and we cannot enhance mount.nfs to wait for a negative amount
> of time (also known as "time travel")

D'oh, I keep forgetting that point.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
index ec86a22613cf..346a8b3af8b5 100644
--- a/utils/mountd/cache.c
+++ b/utils/mountd/cache.c
@@ -687,7 +687,6 @@  static void nfsd_fh(int f)
 	char *found_path = NULL;
 	nfs_export *exp;
 	int i;
-	int dev_missing = 0;
 	char buf[RPC_CHAN_BUF_SIZE], *bp;
 	int blen;
 
@@ -754,11 +753,6 @@  static void nfsd_fh(int f)
 			if (!is_ipaddr_client(dom)
 					&& !namelist_client_matches(exp, dom))
 				continue;
-			if (exp->m_export.e_mountpoint &&
-			    !is_mountpoint(exp->m_export.e_mountpoint[0]?
-					   exp->m_export.e_mountpoint:
-					   exp->m_export.e_path))
-				dev_missing ++;
 
 			if (!match_fsid(&parsed, exp, path))
 				continue;
@@ -801,12 +795,6 @@  static void nfsd_fh(int f)
 		/* FIXME we need to make sure we re-visit this later */
 		goto out;
 	}
-	if (!found && dev_missing) {
-		/* The missing dev could be what we want, so just be
-		 * quite rather than returning stale yet
-		 */
-		goto out;
-	}
 
 	if (found)
 		if (cache_export_ent(buf, sizeof(buf), dom, found, found_path) < 0)