[rfc] nfs: propagate readlink errors in nfs_symlink_filler

Message ID	20240521125840.186618-1-sagi@grimberg.me (mailing list archive)
State	New
Headers	show Received: from mail-wr1-f42.google.com (mail-wr1-f42.google.com [209.85.221.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A3FF71E48B for <linux-nfs@vger.kernel.org>; Tue, 21 May 2024 12:58:45 +0000 (UTC) From: Sagi Grimberg <sagi@grimberg.me> To: linux-nfs@vger.kernel.org Cc: Jeff Layton <jlayton@kernel.org>, Chuck Lever <chuck.lever@oracle.com>, Dan Aloni <dan.aloni@vastdata.com>, Christoph Hellwig <hch@lst.de> Subject: [PATCH rfc] nfs: propagate readlink errors in nfs_symlink_filler Date: Tue, 21 May 2024 15:58:40 +0300 Message-Id: <20240521125840.186618-1-sagi@grimberg.me> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[rfc] nfs: propagate readlink errors in nfs_symlink_filler \| expand [rfc] nfs: propagate readlink errors in nfs_symlink_filler

Sagi Grimberg May 21, 2024, 12:58 p.m. UTC

There is an inherent race where a symlink file may have been overriden
(by a different client) between lookup and readlink, resulting in a
spurious EIO error returned to userspace. Fix this by propagating back
ESTALE errors such that the vfs will retry the lookup/get_link (similar
to nfs4_file_open) at least once.

Cc: Dan Aloni <dan.aloni@vastdata.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
Note that with this change the vfs should retry once for
ESTALE errors. However with an artificial reproducer of high
frequency symlink overrides, nothing prevents the retry to
also encounter ESTALE, propagating the error back to userspace.
The man pages for openat/readlinkat do not list an ESTALE errno.

An alternative attempt (implemented by Dan) was a local retry loop
in nfs_get_link(), if this is an applicable approach, Dan can
share his patch instead.

 fs/nfs/symlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Jeff Layton May 21, 2024, 1:22 p.m. UTC | #1

On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote:
> There is an inherent race where a symlink file may have been overriden
> (by a different client) between lookup and readlink, resulting in a
> spurious EIO error returned to userspace. Fix this by propagating back
> ESTALE errors such that the vfs will retry the lookup/get_link (similar
> to nfs4_file_open) at least once.
> 
> Cc: Dan Aloni <dan.aloni@vastdata.com>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> Note that with this change the vfs should retry once for
> ESTALE errors. However with an artificial reproducer of high
> frequency symlink overrides, nothing prevents the retry to
> also encounter ESTALE, propagating the error back to userspace.
> The man pages for openat/readlinkat do not list an ESTALE errno.
> 
> An alternative attempt (implemented by Dan) was a local retry loop
> in nfs_get_link(), if this is an applicable approach, Dan can
> share his patch instead.
> 
>  fs/nfs/symlink.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c
> index 0e27a2e4e68b..13818129d268 100644
> --- a/fs/nfs/symlink.c
> +++ b/fs/nfs/symlink.c
> @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file *file, struct folio *folio)
>  error:
>  	folio_set_error(folio);
>  	folio_unlock(folio);
> -	return -EIO;
> +	return error;
>  }
>  
>  static const char *nfs_get_link(struct dentry *dentry,

git blame seems to indicate that we've returned -EIO here since the
beginning of the git era (and likely long before that). I see no reason
for us to cloak the real error there though, especially with something
like an ESTALE error.

    Reviewed-by: Jeff Layton <jlayton@kernel.org>

FWIW, I think we shouldn't try to do any retry looping on ESTALE beyond
what we already do.

Yes, we can sometimes trigger ESTALE errors to bubble up to userland if
we really thrash the underlying filesystem when testing, but I think
that's actually desirable:

If you have real workloads across multiple machines that are racing
with other that tightly, then you should probably be using some sort of
locking or other synchronization. If it's clever enough that it
doesn''t need that, then it should be able to deal with the occasional
ESTALE error by retrying on its own.

Cheers,
Jeff

Chuck Lever May 21, 2024, 2:02 p.m. UTC | #2

On Tue, May 21, 2024 at 03:58:40PM +0300, Sagi Grimberg wrote:
> There is an inherent race where a symlink file may have been overriden

Nit: Do you mean "overwritten" ?

> (by a different client) between lookup and readlink, resulting in a
> spurious EIO error returned to userspace. Fix this by propagating back
> ESTALE errors such that the vfs will retry the lookup/get_link (similar
> to nfs4_file_open) at least once.
> 
> Cc: Dan Aloni <dan.aloni@vastdata.com>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> Note that with this change the vfs should retry once for
> ESTALE errors. However with an artificial reproducer of high
> frequency symlink overrides, nothing prevents the retry to

Nit: "overwrites" ?

> also encounter ESTALE, propagating the error back to userspace.
> The man pages for openat/readlinkat do not list an ESTALE errno.

Speaking only as a community member, I consider that an undesirable
behavior regression. IMO it's a bug for a system call to return an
errno that isn't documented. That's likely why this logic has worked
this way for forever.

> An alternative attempt (implemented by Dan) was a local retry loop
> in nfs_get_link(), if this is an applicable approach, Dan can
> share his patch instead.

I'm not entirely convinced by your patch description that returning
an EIO on occasion is a problem. Is it reasonable for the app to
expect that readlinkat() will /never/ fail?

Making symlink semantics more atomic on NFS mounts is probably a
good goal. But IMO the proposed change by itself isn't going to get
you that with high reliability and few or no undesirable side
effects.

Note that NFS client-side patches should be sent To: Trond, Anna,
and Cc: linux-nfs@ . Trond and Anna need to weigh in on this.

>  fs/nfs/symlink.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c
> index 0e27a2e4e68b..13818129d268 100644
> --- a/fs/nfs/symlink.c
> +++ b/fs/nfs/symlink.c
> @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file *file, struct folio *folio)
>  error:
>  	folio_set_error(folio);
>  	folio_unlock(folio);
> -	return -EIO;
> +	return error;
>  }
>  
>  static const char *nfs_get_link(struct dentry *dentry,
> -- 
> 2.40.1
>

Dan Aloni May 21, 2024, 2:36 p.m. UTC | #3

On 2024-05-21 09:22:46, Jeff Layton wrote:
> On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote:
> > There is an inherent race where a symlink file may have been overriden
> > (by a different client) between lookup and readlink, resulting in a
> > spurious EIO error returned to userspace. Fix this by propagating back
> > ESTALE errors such that the vfs will retry the lookup/get_link (similar
> > to nfs4_file_open) at least once.
[..]
> FWIW, I think we shouldn't try to do any retry looping on ESTALE beyond
> what we already do.
> 
> Yes, we can sometimes trigger ESTALE errors to bubble up to userland if
> we really thrash the underlying filesystem when testing, but I think
> that's actually desirable:
> 
> If you have real workloads across multiple machines that are racing
> with other that tightly, then you should probably be using some sort of
> locking or other synchronization. If it's clever enough that it
> doesn''t need that, then it should be able to deal with the occasional
> ESTALE error by retrying on its own.

We saw an issue in a real workload employing the method I describe in
the following test case, where the user was surprised getting an error.

It's a symlink atomicity semantics case, where there's a symlink that is
frequently being overridden using a rename. This use case works well
with local file systems and with some other network file systems
implementations (this was also noticeable as other options where
tested).

There is fixed set of regular files `test_file_{1,2,3}`, and a 'shunt'
symlink that keeps getting recreated to one of them:

    while true; do
        i=1;
	while (( i <= 3 )) ; do
	    ln -s -f test_file_$i shunt
	    i=$((i + 1))
	done
    done

Behind the scenes `ln` creates a symlink with a random name, then
performs the `rename` system call to override `shunt`. When another
client is trying to call `open` via the symlink so it would necessarily
resolve to one of the regular files client-side. The previous FH of `shunt`
becomes stale and sometimes fails this test.

It is why we saw a retry loop being worthwhile to implement, whether
inside the NFS client or outside in the VFS layer, the justification for
it was to prevent the workload from breaking when moving between file
systems.

Sagi Grimberg May 21, 2024, 2:59 p.m. UTC | #4

On 21/05/2024 17:02, Chuck Lever wrote:
> On Tue, May 21, 2024 at 03:58:40PM +0300, Sagi Grimberg wrote:
>> There is an inherent race where a symlink file may have been overriden
> Nit: Do you mean "overwritten" ?

Yes.

>
>
>> (by a different client) between lookup and readlink, resulting in a
>> spurious EIO error returned to userspace. Fix this by propagating back
>> ESTALE errors such that the vfs will retry the lookup/get_link (similar
>> to nfs4_file_open) at least once.
>>
>> Cc: Dan Aloni <dan.aloni@vastdata.com>
>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>> Note that with this change the vfs should retry once for
>> ESTALE errors. However with an artificial reproducer of high
>> frequency symlink overrides, nothing prevents the retry to
> Nit: "overwrites" ?

Yes.

>
>
>> also encounter ESTALE, propagating the error back to userspace.
>> The man pages for openat/readlinkat do not list an ESTALE errno.
> Speaking only as a community member, I consider that an undesirable
> behavior regression. IMO it's a bug for a system call to return an
> errno that isn't documented. That's likely why this logic has worked
> this way for forever.

Well, if this is an issue, it would be paired with a vfs change that
checks for the error-code of the retry and convert it back.
Something like:
--
diff --git a/fs/namei.c b/fs/namei.c
index ceb9ddf8dfdd..9a06408bb52f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3812,6 +3812,8 @@ static struct file *path_openat(struct nameidata *nd,
                 else
                         error = -ESTALE;
         }
+       if (error == -ESTALE && (flags & LOOKUP_REVAL))
+               error = -EIO;
         return ERR_PTR(error);
  }
--

But we'd need to check with Al for this type of a  change...

>
>
>> An alternative attempt (implemented by Dan) was a local retry loop
>> in nfs_get_link(), if this is an applicable approach, Dan can
>> share his patch instead.
> I'm not entirely convinced by your patch description that returning
> an EIO on occasion is a problem. Is it reasonable for the app to
> expect that readlinkat() will /never/ fail?

Maybe not never, but its fairly easy to encounter, and it was definitely
observed in the wild.

>
> Making symlink semantics more atomic on NFS mounts is probably a
> good goal. But IMO the proposed change by itself isn't going to get
> you that with high reliability and few or no undesirable side
> effects.

What undesirable effects?

>
> Note that NFS client-side patches should be sent To: Trond, Anna,
> and Cc: linux-nfs@ . Trond and Anna need to weigh in on this.

Yes, it was sent to linux-nfs so I expect Trond and Anna to see it, you
and Jeff were CC'd because we briefly discussed about this last week at
LSFMM.

Sagi Grimberg May 21, 2024, 3:05 p.m. UTC | #5

On 21/05/2024 16:22, Jeff Layton wrote:
> On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote:
>> There is an inherent race where a symlink file may have been overriden
>> (by a different client) between lookup and readlink, resulting in a
>> spurious EIO error returned to userspace. Fix this by propagating back
>> ESTALE errors such that the vfs will retry the lookup/get_link (similar
>> to nfs4_file_open) at least once.
>>
>> Cc: Dan Aloni <dan.aloni@vastdata.com>
>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>> Note that with this change the vfs should retry once for
>> ESTALE errors. However with an artificial reproducer of high
>> frequency symlink overrides, nothing prevents the retry to
>> also encounter ESTALE, propagating the error back to userspace.
>> The man pages for openat/readlinkat do not list an ESTALE errno.
>>
>> An alternative attempt (implemented by Dan) was a local retry loop
>> in nfs_get_link(), if this is an applicable approach, Dan can
>> share his patch instead.
>>
>>   fs/nfs/symlink.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c
>> index 0e27a2e4e68b..13818129d268 100644
>> --- a/fs/nfs/symlink.c
>> +++ b/fs/nfs/symlink.c
>> @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file *file, struct folio *folio)
>>   error:
>>   	folio_set_error(folio);
>>   	folio_unlock(folio);
>> -	return -EIO;
>> +	return error;
>>   }
>>   
>>   static const char *nfs_get_link(struct dentry *dentry,
> git blame seems to indicate that we've returned -EIO here since the
> beginning of the git era (and likely long before that). I see no reason
> for us to cloak the real error there though, especially with something
> like an ESTALE error.
>
>      Reviewed-by: Jeff Layton <jlayton@kernel.org>
>
> FWIW, I think we shouldn't try to do any retry looping on ESTALE beyond
> what we already do.
>
> Yes, we can sometimes trigger ESTALE errors to bubble up to userland if
> we really thrash the underlying filesystem when testing, but I think
> that's actually desirable:

Returning ESTALE would be an improvement over returning EIO IMO,
but it may be surprising for userspace to see an undocumented errno.
Maybe the man pages can be amended?

>
> If you have real workloads across multiple machines that are racing
> with other that tightly, then you should probably be using some sort of
> locking or other synchronization. If it's clever enough that it
> doesn''t need that, then it should be able to deal with the occasional
> ESTALE error by retrying on its own.

I tend to agree. FWIW Solaris has a config knob for number of stale retries
it does, maybe there is an appetite to have something like that as well?

Trond Myklebust May 21, 2024, 3:13 p.m. UTC | #6

On Tue, 2024-05-21 at 18:05 +0300, Sagi Grimberg wrote:
> 
> 
> On 21/05/2024 16:22, Jeff Layton wrote:
> > On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote:
> > > There is an inherent race where a symlink file may have been
> > > overriden
> > > (by a different client) between lookup and readlink, resulting in
> > > a
> > > spurious EIO error returned to userspace. Fix this by propagating
> > > back
> > > ESTALE errors such that the vfs will retry the lookup/get_link
> > > (similar
> > > to nfs4_file_open) at least once.
> > > 
> > > Cc: Dan Aloni <dan.aloni@vastdata.com>
> > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> > > ---
> > > Note that with this change the vfs should retry once for
> > > ESTALE errors. However with an artificial reproducer of high
> > > frequency symlink overrides, nothing prevents the retry to
> > > also encounter ESTALE, propagating the error back to userspace.
> > > The man pages for openat/readlinkat do not list an ESTALE errno.
> > > 
> > > An alternative attempt (implemented by Dan) was a local retry
> > > loop
> > > in nfs_get_link(), if this is an applicable approach, Dan can
> > > share his patch instead.
> > > 
> > >   fs/nfs/symlink.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c
> > > index 0e27a2e4e68b..13818129d268 100644
> > > --- a/fs/nfs/symlink.c
> > > +++ b/fs/nfs/symlink.c
> > > @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file
> > > *file, struct folio *folio)
> > >   error:
> > >   	folio_set_error(folio);
> > >   	folio_unlock(folio);
> > > -	return -EIO;
> > > +	return error;
> > >   }
> > >   
> > >   static const char *nfs_get_link(struct dentry *dentry,
> > git blame seems to indicate that we've returned -EIO here since the
> > beginning of the git era (and likely long before that). I see no
> > reason
> > for us to cloak the real error there though, especially with
> > something
> > like an ESTALE error.
> > 
> >      Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > 
> > FWIW, I think we shouldn't try to do any retry looping on ESTALE
> > beyond
> > what we already do.
> > 
> > Yes, we can sometimes trigger ESTALE errors to bubble up to
> > userland if
> > we really thrash the underlying filesystem when testing, but I
> > think
> > that's actually desirable:
> 
> Returning ESTALE would be an improvement over returning EIO IMO,
> but it may be surprising for userspace to see an undocumented errno.
> Maybe the man pages can be amended?
> 
> > 
> > If you have real workloads across multiple machines that are racing
> > with other that tightly, then you should probably be using some
> > sort of
> > locking or other synchronization. If it's clever enough that it
> > doesn''t need that, then it should be able to deal with the
> > occasional
> > ESTALE error by retrying on its own.
> 
> I tend to agree. FWIW Solaris has a config knob for number of stale
> retries
> it does, maybe there is an appetite to have something like that as
> well?
> 

Any reason why we couldn't just return ENOENT in the case where the
filehandle is stale? There will have been an unlink() on the symlink at
some point in the recent past.

Chuck Lever May 21, 2024, 3:24 p.m. UTC | #7

> On May 21, 2024, at 11:13 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Tue, 2024-05-21 at 18:05 +0300, Sagi Grimberg wrote:
>> 
>> 
>> On 21/05/2024 16:22, Jeff Layton wrote:
>>> On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote:
>>>> There is an inherent race where a symlink file may have been
>>>> overriden
>>>> (by a different client) between lookup and readlink, resulting in
>>>> a
>>>> spurious EIO error returned to userspace. Fix this by propagating
>>>> back
>>>> ESTALE errors such that the vfs will retry the lookup/get_link
>>>> (similar
>>>> to nfs4_file_open) at least once.
>>>> 
>>>> Cc: Dan Aloni <dan.aloni@vastdata.com>
>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>> ---
>>>> Note that with this change the vfs should retry once for
>>>> ESTALE errors. However with an artificial reproducer of high
>>>> frequency symlink overrides, nothing prevents the retry to
>>>> also encounter ESTALE, propagating the error back to userspace.
>>>> The man pages for openat/readlinkat do not list an ESTALE errno.
>>>> 
>>>> An alternative attempt (implemented by Dan) was a local retry
>>>> loop
>>>> in nfs_get_link(), if this is an applicable approach, Dan can
>>>> share his patch instead.
>>>> 
>>>>   fs/nfs/symlink.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>> 
>>>> diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c
>>>> index 0e27a2e4e68b..13818129d268 100644
>>>> --- a/fs/nfs/symlink.c
>>>> +++ b/fs/nfs/symlink.c
>>>> @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file
>>>> *file, struct folio *folio)
>>>>   error:
>>>>    folio_set_error(folio);
>>>>    folio_unlock(folio);
>>>> - return -EIO;
>>>> + return error;
>>>>   }
>>>>   
>>>>   static const char *nfs_get_link(struct dentry *dentry,
>>> git blame seems to indicate that we've returned -EIO here since the
>>> beginning of the git era (and likely long before that). I see no
>>> reason
>>> for us to cloak the real error there though, especially with
>>> something
>>> like an ESTALE error.
>>> 
>>>      Reviewed-by: Jeff Layton <jlayton@kernel.org>
>>> 
>>> FWIW, I think we shouldn't try to do any retry looping on ESTALE
>>> beyond
>>> what we already do.
>>> 
>>> Yes, we can sometimes trigger ESTALE errors to bubble up to
>>> userland if
>>> we really thrash the underlying filesystem when testing, but I
>>> think
>>> that's actually desirable:
>> 
>> Returning ESTALE would be an improvement over returning EIO IMO,
>> but it may be surprising for userspace to see an undocumented errno.
>> Maybe the man pages can be amended?
>> 
>>> 
>>> If you have real workloads across multiple machines that are racing
>>> with other that tightly, then you should probably be using some
>>> sort of
>>> locking or other synchronization. If it's clever enough that it
>>> doesn''t need that, then it should be able to deal with the
>>> occasional
>>> ESTALE error by retrying on its own.
>> 
>> I tend to agree. FWIW Solaris has a config knob for number of stale
>> retries
>> it does, maybe there is an appetite to have something like that as
>> well?
>> 
> 
> Any reason why we couldn't just return ENOENT in the case where the
> filehandle is stale? There will have been an unlink() on the symlink at
> some point in the recent past.

To me ENOENT is preferable to both EIO and ESTALE.


--
Chuck Lever

Sagi Grimberg May 21, 2024, 4:09 p.m. UTC | #8

On 21/05/2024 18:13, Trond Myklebust wrote:
> On Tue, 2024-05-21 at 18:05 +0300, Sagi Grimberg wrote:
>>
>> On 21/05/2024 16:22, Jeff Layton wrote:
>>> On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote:
>>>> There is an inherent race where a symlink file may have been
>>>> overriden
>>>> (by a different client) between lookup and readlink, resulting in
>>>> a
>>>> spurious EIO error returned to userspace. Fix this by propagating
>>>> back
>>>> ESTALE errors such that the vfs will retry the lookup/get_link
>>>> (similar
>>>> to nfs4_file_open) at least once.
>>>>
>>>> Cc: Dan Aloni <dan.aloni@vastdata.com>
>>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>>> ---
>>>> Note that with this change the vfs should retry once for
>>>> ESTALE errors. However with an artificial reproducer of high
>>>> frequency symlink overrides, nothing prevents the retry to
>>>> also encounter ESTALE, propagating the error back to userspace.
>>>> The man pages for openat/readlinkat do not list an ESTALE errno.
>>>>
>>>> An alternative attempt (implemented by Dan) was a local retry
>>>> loop
>>>> in nfs_get_link(), if this is an applicable approach, Dan can
>>>> share his patch instead.
>>>>
>>>>    fs/nfs/symlink.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c
>>>> index 0e27a2e4e68b..13818129d268 100644
>>>> --- a/fs/nfs/symlink.c
>>>> +++ b/fs/nfs/symlink.c
>>>> @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file
>>>> *file, struct folio *folio)
>>>>    error:
>>>>    	folio_set_error(folio);
>>>>    	folio_unlock(folio);
>>>> -	return -EIO;
>>>> +	return error;
>>>>    }
>>>>    
>>>>    static const char *nfs_get_link(struct dentry *dentry,
>>> git blame seems to indicate that we've returned -EIO here since the
>>> beginning of the git era (and likely long before that). I see no
>>> reason
>>> for us to cloak the real error there though, especially with
>>> something
>>> like an ESTALE error.
>>>
>>>       Reviewed-by: Jeff Layton <jlayton@kernel.org>
>>>
>>> FWIW, I think we shouldn't try to do any retry looping on ESTALE
>>> beyond
>>> what we already do.
>>>
>>> Yes, we can sometimes trigger ESTALE errors to bubble up to
>>> userland if
>>> we really thrash the underlying filesystem when testing, but I
>>> think
>>> that's actually desirable:
>> Returning ESTALE would be an improvement over returning EIO IMO,
>> but it may be surprising for userspace to see an undocumented errno.
>> Maybe the man pages can be amended?
>>
>>> If you have real workloads across multiple machines that are racing
>>> with other that tightly, then you should probably be using some
>>> sort of
>>> locking or other synchronization. If it's clever enough that it
>>> doesn''t need that, then it should be able to deal with the
>>> occasional
>>> ESTALE error by retrying on its own.
>> I tend to agree. FWIW Solaris has a config knob for number of stale
>> retries
>> it does, maybe there is an appetite to have something like that as
>> well?
>>
> Any reason why we couldn't just return ENOENT in the case where the
> filehandle is stale? There will have been an unlink() on the symlink at
> some point in the recent past.
>

No reason that I can see. However given that this was observed in the 
wild, and essentially
a common pattern with symlinks (overwrite a config file for example), I 
think its reasonable
to have the vfs at least do a single retry, by simply returning ESTALE.
However NFS cannot distinguish between first and second retries 
afaict... Perhaps the
vfs can help with a ESTALE->ENOENT conversion?

Dan Aloni May 22, 2024, 4:41 a.m. UTC | #9

On 2024-05-21 15:24:19, Chuck Lever III wrote:
> 
> 
> > On May 21, 2024, at 11:13 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> > 
> > On Tue, 2024-05-21 at 18:05 +0300, Sagi Grimberg wrote:
> >> 
> >> 
> >> On 21/05/2024 16:22, Jeff Layton wrote:
> >>> On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote:
> >>>> There is an inherent race where a symlink file may have been
> >>>> overriden
> >>>> (by a different client) between lookup and readlink, resulting in
> >>>> a
> >>>> spurious EIO error returned to userspace. Fix this by propagating
> >>>> back
> >>>> ESTALE errors such that the vfs will retry the lookup/get_link
> >>>> (similar
> >>>> to nfs4_file_open) at least once.
> >>>> 
> >>>> Cc: Dan Aloni <dan.aloni@vastdata.com>
> >>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> >>>> ---
> >>>> Note that with this change the vfs should retry once for
> >>>> ESTALE errors. However with an artificial reproducer of high
> >>>> frequency symlink overrides, nothing prevents the retry to
> >>>> also encounter ESTALE, propagating the error back to userspace.
> >>>> The man pages for openat/readlinkat do not list an ESTALE errno.
> >>>> 
> >>>> An alternative attempt (implemented by Dan) was a local retry
> >>>> loop
> >>>> in nfs_get_link(), if this is an applicable approach, Dan can
> >>>> share his patch instead.
> >>>> 
> >>>>   fs/nfs/symlink.c | 2 +-
> >>>>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>>> 
> >>>> diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c
> >>>> index 0e27a2e4e68b..13818129d268 100644
> >>>> --- a/fs/nfs/symlink.c
> >>>> +++ b/fs/nfs/symlink.c
> >>>> @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file
> >>>> *file, struct folio *folio)
> >>>>   error:
> >>>>    folio_set_error(folio);
> >>>>    folio_unlock(folio);
> >>>> - return -EIO;
> >>>> + return error;
> >>>>   }
> >>>>   
> >>>>   static const char *nfs_get_link(struct dentry *dentry,
> >>> git blame seems to indicate that we've returned -EIO here since the
> >>> beginning of the git era (and likely long before that). I see no
> >>> reason
> >>> for us to cloak the real error there though, especially with
> >>> something
> >>> like an ESTALE error.
> >>> 
> >>>      Reviewed-by: Jeff Layton <jlayton@kernel.org>
> >>> 
> >>> FWIW, I think we shouldn't try to do any retry looping on ESTALE
> >>> beyond
> >>> what we already do.
> >>> 
> >>> Yes, we can sometimes trigger ESTALE errors to bubble up to
> >>> userland if
> >>> we really thrash the underlying filesystem when testing, but I
> >>> think
> >>> that's actually desirable:
> >> 
> >> Returning ESTALE would be an improvement over returning EIO IMO,
> >> but it may be surprising for userspace to see an undocumented errno.
> >> Maybe the man pages can be amended?
> >> 
> >>> 
> >>> If you have real workloads across multiple machines that are racing
> >>> with other that tightly, then you should probably be using some
> >>> sort of
> >>> locking or other synchronization. If it's clever enough that it
> >>> doesn''t need that, then it should be able to deal with the
> >>> occasional
> >>> ESTALE error by retrying on its own.
> >> 
> >> I tend to agree. FWIW Solaris has a config knob for number of stale
> >> retries
> >> it does, maybe there is an appetite to have something like that as
> >> well?
> >> 
> > 
> > Any reason why we couldn't just return ENOENT in the case where the
> > filehandle is stale? There will have been an unlink() on the symlink at
> > some point in the recent past.
> 
> To me ENOENT is preferable to both EIO and ESTALE.

Another view on that, where in the scenario of `rename` causing the
unlinking, there was no situation of 'no entry' as the directory entry
was only updated and not removed. So ENOENT in this regard by the
meaning of 'no entry' would not reflect what has really happened.

(unless you go with the 'no entity' interpretation of ENOENT, but that
would be against most of the POSIX-spec cases where ENOENT is returned
which deal primarily with missing path components i.e. names to
objects and not the objects themselves)

Trond Myklebust May 22, 2024, 12:11 p.m. UTC | #10

On Wed, 2024-05-22 at 07:41 +0300, Dan Aloni wrote:
> On 2024-05-21 15:24:19, Chuck Lever III wrote:
> > 
> > 
> > > On May 21, 2024, at 11:13 AM, Trond Myklebust
> > > <trondmy@hammerspace.com> wrote:
> > > 
> > > On Tue, 2024-05-21 at 18:05 +0300, Sagi Grimberg wrote:
> > > > 
> > > > 
> > > > On 21/05/2024 16:22, Jeff Layton wrote:
> > > > > On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote:
> > > > > > There is an inherent race where a symlink file may have
> > > > > > been
> > > > > > overriden
> > > > > > (by a different client) between lookup and readlink,
> > > > > > resulting in
> > > > > > a
> > > > > > spurious EIO error returned to userspace. Fix this by
> > > > > > propagating
> > > > > > back
> > > > > > ESTALE errors such that the vfs will retry the
> > > > > > lookup/get_link
> > > > > > (similar
> > > > > > to nfs4_file_open) at least once.
> > > > > > 
> > > > > > Cc: Dan Aloni <dan.aloni@vastdata.com>
> > > > > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> > > > > > ---
> > > > > > Note that with this change the vfs should retry once for
> > > > > > ESTALE errors. However with an artificial reproducer of
> > > > > > high
> > > > > > frequency symlink overrides, nothing prevents the retry to
> > > > > > also encounter ESTALE, propagating the error back to
> > > > > > userspace.
> > > > > > The man pages for openat/readlinkat do not list an ESTALE
> > > > > > errno.
> > > > > > 
> > > > > > An alternative attempt (implemented by Dan) was a local
> > > > > > retry
> > > > > > loop
> > > > > > in nfs_get_link(), if this is an applicable approach, Dan
> > > > > > can
> > > > > > share his patch instead.
> > > > > > 
> > > > > >   fs/nfs/symlink.c | 2 +-
> > > > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c
> > > > > > index 0e27a2e4e68b..13818129d268 100644
> > > > > > --- a/fs/nfs/symlink.c
> > > > > > +++ b/fs/nfs/symlink.c
> > > > > > @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file
> > > > > > *file, struct folio *folio)
> > > > > >   error:
> > > > > >    folio_set_error(folio);
> > > > > >    folio_unlock(folio);
> > > > > > - return -EIO;
> > > > > > + return error;
> > > > > >   }
> > > > > >   
> > > > > >   static const char *nfs_get_link(struct dentry *dentry,
> > > > > git blame seems to indicate that we've returned -EIO here
> > > > > since the
> > > > > beginning of the git era (and likely long before that). I see
> > > > > no
> > > > > reason
> > > > > for us to cloak the real error there though, especially with
> > > > > something
> > > > > like an ESTALE error.
> > > > > 
> > > > >      Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > > > > 
> > > > > FWIW, I think we shouldn't try to do any retry looping on
> > > > > ESTALE
> > > > > beyond
> > > > > what we already do.
> > > > > 
> > > > > Yes, we can sometimes trigger ESTALE errors to bubble up to
> > > > > userland if
> > > > > we really thrash the underlying filesystem when testing, but
> > > > > I
> > > > > think
> > > > > that's actually desirable:
> > > > 
> > > > Returning ESTALE would be an improvement over returning EIO
> > > > IMO,
> > > > but it may be surprising for userspace to see an undocumented
> > > > errno.
> > > > Maybe the man pages can be amended?
> > > > 
> > > > > 
> > > > > If you have real workloads across multiple machines that are
> > > > > racing
> > > > > with other that tightly, then you should probably be using
> > > > > some
> > > > > sort of
> > > > > locking or other synchronization. If it's clever enough that
> > > > > it
> > > > > doesn''t need that, then it should be able to deal with the
> > > > > occasional
> > > > > ESTALE error by retrying on its own.
> > > > 
> > > > I tend to agree. FWIW Solaris has a config knob for number of
> > > > stale
> > > > retries
> > > > it does, maybe there is an appetite to have something like that
> > > > as
> > > > well?
> > > > 
> > > 
> > > Any reason why we couldn't just return ENOENT in the case where
> > > the
> > > filehandle is stale? There will have been an unlink() on the
> > > symlink at
> > > some point in the recent past.
> > 
> > To me ENOENT is preferable to both EIO and ESTALE.
> 
> Another view on that, where in the scenario of `rename` causing the
> unlinking, there was no situation of 'no entry' as the directory
> entry
> was only updated and not removed. So ENOENT in this regard by the
> meaning of 'no entry' would not reflect what has really happened.
> 
> (unless you go with the 'no entity' interpretation of ENOENT, but
> that
> would be against most of the POSIX-spec cases where ENOENT is
> returned
> which deal primarily with missing path components i.e. names to
> objects and not the objects themselves)
> 

The Linux NFS client doesn't support volatile filehandles.

Sagi Grimberg May 22, 2024, 7:40 p.m. UTC | #11

Hey Trond,

>> filehandle is stale? There will have been an unlink() on the symlink at
>> some point in the recent past.
>>
>
> No reason that I can see. However given that this was observed in the 
> wild, and essentially
> a common pattern with symlinks (overwrite a config file for example), 
> I think its reasonable
> to have the vfs at least do a single retry, by simply returning ESTALE.
> However NFS cannot distinguish between first and second retries 
> afaict... Perhaps the
> vfs can help with a ESTALE->ENOENT conversion?

So what do you suggest we do here? IMO at a minimum NFS should retry 
once similar
to nfs4_file_open (it would probably address 99.9% of the use-cases 
where symlinks are
not overwritten in a high enough frequency for the client to see 2 
consecutive stale readlink
rpc rplies).

I can send a patch paired with a vfs ESTALE conversion patch? 
alternatively retry locally in NFS...
I would like to understand your position here.

Trond Myklebust May 22, 2024, 9:04 p.m. UTC | #12

On Wed, 2024-05-22 at 22:40 +0300, Sagi Grimberg wrote:
> Hey Trond,
> 
> > > filehandle is stale? There will have been an unlink() on the
> > > symlink at
> > > some point in the recent past.
> > > 
> > 
> > No reason that I can see. However given that this was observed in
> > the 
> > wild, and essentially
> > a common pattern with symlinks (overwrite a config file for
> > example), 
> > I think its reasonable
> > to have the vfs at least do a single retry, by simply returning
> > ESTALE.
> > However NFS cannot distinguish between first and second retries 
> > afaict... Perhaps the
> > vfs can help with a ESTALE->ENOENT conversion?
> 
> So what do you suggest we do here? IMO at a minimum NFS should retry 
> once similar
> to nfs4_file_open (it would probably address 99.9% of the use-cases 
> where symlinks are
> not overwritten in a high enough frequency for the client to see 2 
> consecutive stale readlink
> rpc rplies).
> 
> I can send a patch paired with a vfs ESTALE conversion patch? 
> alternatively retry locally in NFS...
> I would like to understand your position here.
> 

Looking more closely at nfs_get_link(), it is obvious that it can
already return ESTALE (thanks to the call to nfs_revalidate_mapping())
and looking at do_readlinkat(), it has already been plumbed through
with a call to retry_estale().

So I think we can take your patch as is, since it doesn't add any error
cases that callers of readlink() don't have to handle already.

We might still want to think about cleaning up the output of the VFS in
all these cases, so that we don't return ESTALE when it isn't allowed
by POSIX, but that would be a separate task.

Sagi Grimberg May 22, 2024, 9:19 p.m. UTC | #13

>> So what do you suggest we do here? IMO at a minimum NFS should retry
>> once similar
>> to nfs4_file_open (it would probably address 99.9% of the use-cases
>> where symlinks are
>> not overwritten in a high enough frequency for the client to see 2
>> consecutive stale readlink
>> rpc rplies).
>>
>> I can send a patch paired with a vfs ESTALE conversion patch?
>> alternatively retry locally in NFS...
>> I would like to understand your position here.
>>
> Looking more closely at nfs_get_link(), it is obvious that it can
> already return ESTALE (thanks to the call to nfs_revalidate_mapping())
> and looking at do_readlinkat(), it has already been plumbed through
> with a call to retry_estale().
>
> So I think we can take your patch as is, since it doesn't add any error
> cases that callers of readlink() don't have to handle already.

Sounds good.

>
> We might still want to think about cleaning up the output of the VFS in
> all these cases, so that we don't return ESTALE when it isn't allowed
> by POSIX, but that would be a separate task.
>

Yes, I can follow up on that later...

Jeff Layton May 22, 2024, 9:20 p.m. UTC | #14

On Wed, 2024-05-22 at 21:04 +0000, Trond Myklebust wrote:
> On Wed, 2024-05-22 at 22:40 +0300, Sagi Grimberg wrote:
> > Hey Trond,
> > 
> > > > filehandle is stale? There will have been an unlink() on the
> > > > symlink at
> > > > some point in the recent past.
> > > > 
> > > 
> > > No reason that I can see. However given that this was observed in
> > > the 
> > > wild, and essentially
> > > a common pattern with symlinks (overwrite a config file for
> > > example), 
> > > I think its reasonable
> > > to have the vfs at least do a single retry, by simply returning
> > > ESTALE.
> > > However NFS cannot distinguish between first and second retries 
> > > afaict... Perhaps the
> > > vfs can help with a ESTALE->ENOENT conversion?
> > 
> > So what do you suggest we do here? IMO at a minimum NFS should retry 
> > once similar
> > to nfs4_file_open (it would probably address 99.9% of the use-cases 
> > where symlinks are
> > not overwritten in a high enough frequency for the client to see 2 
> > consecutive stale readlink
> > rpc rplies).
> > 
> > I can send a patch paired with a vfs ESTALE conversion patch? 
> > alternatively retry locally in NFS...
> > I would like to understand your position here.
> > 
> 
> Looking more closely at nfs_get_link(), it is obvious that it can
> already return ESTALE (thanks to the call to nfs_revalidate_mapping())
> and looking at do_readlinkat(), it has already been plumbed through
> with a call to retry_estale().
> 
> So I think we can take your patch as is, since it doesn't add any error
> cases that callers of readlink() don't have to handle already.
> 
> We might still want to think about cleaning up the output of the VFS in
> all these cases, so that we don't return ESTALE when it isn't allowed
> by POSIX, but that would be a separate task.
> 

I think we can effectively turn ESTALE into ENOENT in most (all?)
syscalls that take a pathname, since you can argue that it would have
been an ENOENT had we gotten in there just a little later.

To fix this the right way, I think you'd have to plumb this translation
into most path-based syscall handlers at a fairly high level. Maybe we
need some sort of generic sanitize_errno() handler that we call from
these sorts of calls?

In any case, I think that's a somewhat larger project. :)

Sagi Grimberg May 30, 2024, 6:08 p.m. UTC | #15

On 23/05/2024 0:19, Sagi Grimberg wrote:
>
>>> So what do you suggest we do here? IMO at a minimum NFS should retry
>>> once similar
>>> to nfs4_file_open (it would probably address 99.9% of the use-cases
>>> where symlinks are
>>> not overwritten in a high enough frequency for the client to see 2
>>> consecutive stale readlink
>>> rpc rplies).
>>>
>>> I can send a patch paired with a vfs ESTALE conversion patch?
>>> alternatively retry locally in NFS...
>>> I would like to understand your position here.
>>>
>> Looking more closely at nfs_get_link(), it is obvious that it can
>> already return ESTALE (thanks to the call to nfs_revalidate_mapping())
>> and looking at do_readlinkat(), it has already been plumbed through
>> with a call to retry_estale().
>>
>> So I think we can take your patch as is, since it doesn't add any error
>> cases that callers of readlink() don't have to handle already.
>
> Sounds good.
>
>>
>> We might still want to think about cleaning up the output of the VFS in
>> all these cases, so that we don't return ESTALE when it isn't allowed
>> by POSIX, but that would be a separate task.
>>
>
> Yes, I can follow up on that later...
>

Hey Trond,
is there anything else you are expecting to see before this is taken to 
your tree?

Trond Myklebust May 30, 2024, 6:21 p.m. UTC | #16

On Thu, 2024-05-30 at 21:08 +0300, Sagi Grimberg wrote:
> 
> 
> On 23/05/2024 0:19, Sagi Grimberg wrote:
> > 
> > > > So what do you suggest we do here? IMO at a minimum NFS should
> > > > retry
> > > > once similar
> > > > to nfs4_file_open (it would probably address 99.9% of the use-
> > > > cases
> > > > where symlinks are
> > > > not overwritten in a high enough frequency for the client to
> > > > see 2
> > > > consecutive stale readlink
> > > > rpc rplies).
> > > > 
> > > > I can send a patch paired with a vfs ESTALE conversion patch?
> > > > alternatively retry locally in NFS...
> > > > I would like to understand your position here.
> > > > 
> > > Looking more closely at nfs_get_link(), it is obvious that it can
> > > already return ESTALE (thanks to the call to
> > > nfs_revalidate_mapping())
> > > and looking at do_readlinkat(), it has already been plumbed
> > > through
> > > with a call to retry_estale().
> > > 
> > > So I think we can take your patch as is, since it doesn't add any
> > > error
> > > cases that callers of readlink() don't have to handle already.
> > 
> > Sounds good.
> > 
> > > 
> > > We might still want to think about cleaning up the output of the
> > > VFS in
> > > all these cases, so that we don't return ESTALE when it isn't
> > > allowed
> > > by POSIX, but that would be a separate task.
> > > 
> > 
> > Yes, I can follow up on that later...
> > 
> 
> Hey Trond,
> is there anything else you are expecting to see before this is taken
> to 
> your tree?
> 

It's already queued in my testing branch:
https://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/testing
I'll probably push that into the linux-next branch over the weekend.

Sagi Grimberg May 31, 2024, 4:24 a.m. UTC | #17

>> Hey Trond,
>> is there anything else you are expecting to see before this is taken
>> to
>> your tree?
>>
> It's already queued in my testing branch:
> https://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/testing
> I'll probably push that into the linux-next branch over the weekend.
>

Ah, I was looking at your linux-next.

Thanks

[rfc] nfs: propagate readlink errors in nfs_symlink_filler

Commit Message

Comments

Patch