Message ID | 20240521125840.186618-1-sagi@grimberg.me (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [rfc] nfs: propagate readlink errors in nfs_symlink_filler | expand |
On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote: > There is an inherent race where a symlink file may have been overriden > (by a different client) between lookup and readlink, resulting in a > spurious EIO error returned to userspace. Fix this by propagating back > ESTALE errors such that the vfs will retry the lookup/get_link (similar > to nfs4_file_open) at least once. > > Cc: Dan Aloni <dan.aloni@vastdata.com> > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > --- > Note that with this change the vfs should retry once for > ESTALE errors. However with an artificial reproducer of high > frequency symlink overrides, nothing prevents the retry to > also encounter ESTALE, propagating the error back to userspace. > The man pages for openat/readlinkat do not list an ESTALE errno. > > An alternative attempt (implemented by Dan) was a local retry loop > in nfs_get_link(), if this is an applicable approach, Dan can > share his patch instead. > > fs/nfs/symlink.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c > index 0e27a2e4e68b..13818129d268 100644 > --- a/fs/nfs/symlink.c > +++ b/fs/nfs/symlink.c > @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file *file, struct folio *folio) > error: > folio_set_error(folio); > folio_unlock(folio); > - return -EIO; > + return error; > } > > static const char *nfs_get_link(struct dentry *dentry, git blame seems to indicate that we've returned -EIO here since the beginning of the git era (and likely long before that). I see no reason for us to cloak the real error there though, especially with something like an ESTALE error. Reviewed-by: Jeff Layton <jlayton@kernel.org> FWIW, I think we shouldn't try to do any retry looping on ESTALE beyond what we already do. Yes, we can sometimes trigger ESTALE errors to bubble up to userland if we really thrash the underlying filesystem when testing, but I think that's actually desirable: If you have real workloads across multiple machines that are racing with other that tightly, then you should probably be using some sort of locking or other synchronization. If it's clever enough that it doesn''t need that, then it should be able to deal with the occasional ESTALE error by retrying on its own. Cheers, Jeff
On Tue, May 21, 2024 at 03:58:40PM +0300, Sagi Grimberg wrote: > There is an inherent race where a symlink file may have been overriden Nit: Do you mean "overwritten" ? > (by a different client) between lookup and readlink, resulting in a > spurious EIO error returned to userspace. Fix this by propagating back > ESTALE errors such that the vfs will retry the lookup/get_link (similar > to nfs4_file_open) at least once. > > Cc: Dan Aloni <dan.aloni@vastdata.com> > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > --- > Note that with this change the vfs should retry once for > ESTALE errors. However with an artificial reproducer of high > frequency symlink overrides, nothing prevents the retry to Nit: "overwrites" ? > also encounter ESTALE, propagating the error back to userspace. > The man pages for openat/readlinkat do not list an ESTALE errno. Speaking only as a community member, I consider that an undesirable behavior regression. IMO it's a bug for a system call to return an errno that isn't documented. That's likely why this logic has worked this way for forever. > An alternative attempt (implemented by Dan) was a local retry loop > in nfs_get_link(), if this is an applicable approach, Dan can > share his patch instead. I'm not entirely convinced by your patch description that returning an EIO on occasion is a problem. Is it reasonable for the app to expect that readlinkat() will /never/ fail? Making symlink semantics more atomic on NFS mounts is probably a good goal. But IMO the proposed change by itself isn't going to get you that with high reliability and few or no undesirable side effects. Note that NFS client-side patches should be sent To: Trond, Anna, and Cc: linux-nfs@ . Trond and Anna need to weigh in on this. > fs/nfs/symlink.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c > index 0e27a2e4e68b..13818129d268 100644 > --- a/fs/nfs/symlink.c > +++ b/fs/nfs/symlink.c > @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file *file, struct folio *folio) > error: > folio_set_error(folio); > folio_unlock(folio); > - return -EIO; > + return error; > } > > static const char *nfs_get_link(struct dentry *dentry, > -- > 2.40.1 >
On 2024-05-21 09:22:46, Jeff Layton wrote: > On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote: > > There is an inherent race where a symlink file may have been overriden > > (by a different client) between lookup and readlink, resulting in a > > spurious EIO error returned to userspace. Fix this by propagating back > > ESTALE errors such that the vfs will retry the lookup/get_link (similar > > to nfs4_file_open) at least once. [..] > FWIW, I think we shouldn't try to do any retry looping on ESTALE beyond > what we already do. > > Yes, we can sometimes trigger ESTALE errors to bubble up to userland if > we really thrash the underlying filesystem when testing, but I think > that's actually desirable: > > If you have real workloads across multiple machines that are racing > with other that tightly, then you should probably be using some sort of > locking or other synchronization. If it's clever enough that it > doesn''t need that, then it should be able to deal with the occasional > ESTALE error by retrying on its own. We saw an issue in a real workload employing the method I describe in the following test case, where the user was surprised getting an error. It's a symlink atomicity semantics case, where there's a symlink that is frequently being overridden using a rename. This use case works well with local file systems and with some other network file systems implementations (this was also noticeable as other options where tested). There is fixed set of regular files `test_file_{1,2,3}`, and a 'shunt' symlink that keeps getting recreated to one of them: while true; do i=1; while (( i <= 3 )) ; do ln -s -f test_file_$i shunt i=$((i + 1)) done done Behind the scenes `ln` creates a symlink with a random name, then performs the `rename` system call to override `shunt`. When another client is trying to call `open` via the symlink so it would necessarily resolve to one of the regular files client-side. The previous FH of `shunt` becomes stale and sometimes fails this test. It is why we saw a retry loop being worthwhile to implement, whether inside the NFS client or outside in the VFS layer, the justification for it was to prevent the workload from breaking when moving between file systems.
On 21/05/2024 17:02, Chuck Lever wrote: > On Tue, May 21, 2024 at 03:58:40PM +0300, Sagi Grimberg wrote: >> There is an inherent race where a symlink file may have been overriden > Nit: Do you mean "overwritten" ? Yes. > > >> (by a different client) between lookup and readlink, resulting in a >> spurious EIO error returned to userspace. Fix this by propagating back >> ESTALE errors such that the vfs will retry the lookup/get_link (similar >> to nfs4_file_open) at least once. >> >> Cc: Dan Aloni <dan.aloni@vastdata.com> >> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> >> --- >> Note that with this change the vfs should retry once for >> ESTALE errors. However with an artificial reproducer of high >> frequency symlink overrides, nothing prevents the retry to > Nit: "overwrites" ? Yes. > > >> also encounter ESTALE, propagating the error back to userspace. >> The man pages for openat/readlinkat do not list an ESTALE errno. > Speaking only as a community member, I consider that an undesirable > behavior regression. IMO it's a bug for a system call to return an > errno that isn't documented. That's likely why this logic has worked > this way for forever. Well, if this is an issue, it would be paired with a vfs change that checks for the error-code of the retry and convert it back. Something like: -- diff --git a/fs/namei.c b/fs/namei.c index ceb9ddf8dfdd..9a06408bb52f 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3812,6 +3812,8 @@ static struct file *path_openat(struct nameidata *nd, else error = -ESTALE; } + if (error == -ESTALE && (flags & LOOKUP_REVAL)) + error = -EIO; return ERR_PTR(error); } -- But we'd need to check with Al for this type of a change... > > >> An alternative attempt (implemented by Dan) was a local retry loop >> in nfs_get_link(), if this is an applicable approach, Dan can >> share his patch instead. > I'm not entirely convinced by your patch description that returning > an EIO on occasion is a problem. Is it reasonable for the app to > expect that readlinkat() will /never/ fail? Maybe not never, but its fairly easy to encounter, and it was definitely observed in the wild. > > Making symlink semantics more atomic on NFS mounts is probably a > good goal. But IMO the proposed change by itself isn't going to get > you that with high reliability and few or no undesirable side > effects. What undesirable effects? > > Note that NFS client-side patches should be sent To: Trond, Anna, > and Cc: linux-nfs@ . Trond and Anna need to weigh in on this. Yes, it was sent to linux-nfs so I expect Trond and Anna to see it, you and Jeff were CC'd because we briefly discussed about this last week at LSFMM.
On 21/05/2024 16:22, Jeff Layton wrote: > On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote: >> There is an inherent race where a symlink file may have been overriden >> (by a different client) between lookup and readlink, resulting in a >> spurious EIO error returned to userspace. Fix this by propagating back >> ESTALE errors such that the vfs will retry the lookup/get_link (similar >> to nfs4_file_open) at least once. >> >> Cc: Dan Aloni <dan.aloni@vastdata.com> >> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> >> --- >> Note that with this change the vfs should retry once for >> ESTALE errors. However with an artificial reproducer of high >> frequency symlink overrides, nothing prevents the retry to >> also encounter ESTALE, propagating the error back to userspace. >> The man pages for openat/readlinkat do not list an ESTALE errno. >> >> An alternative attempt (implemented by Dan) was a local retry loop >> in nfs_get_link(), if this is an applicable approach, Dan can >> share his patch instead. >> >> fs/nfs/symlink.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c >> index 0e27a2e4e68b..13818129d268 100644 >> --- a/fs/nfs/symlink.c >> +++ b/fs/nfs/symlink.c >> @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file *file, struct folio *folio) >> error: >> folio_set_error(folio); >> folio_unlock(folio); >> - return -EIO; >> + return error; >> } >> >> static const char *nfs_get_link(struct dentry *dentry, > git blame seems to indicate that we've returned -EIO here since the > beginning of the git era (and likely long before that). I see no reason > for us to cloak the real error there though, especially with something > like an ESTALE error. > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > > FWIW, I think we shouldn't try to do any retry looping on ESTALE beyond > what we already do. > > Yes, we can sometimes trigger ESTALE errors to bubble up to userland if > we really thrash the underlying filesystem when testing, but I think > that's actually desirable: Returning ESTALE would be an improvement over returning EIO IMO, but it may be surprising for userspace to see an undocumented errno. Maybe the man pages can be amended? > > If you have real workloads across multiple machines that are racing > with other that tightly, then you should probably be using some sort of > locking or other synchronization. If it's clever enough that it > doesn''t need that, then it should be able to deal with the occasional > ESTALE error by retrying on its own. I tend to agree. FWIW Solaris has a config knob for number of stale retries it does, maybe there is an appetite to have something like that as well?
On Tue, 2024-05-21 at 18:05 +0300, Sagi Grimberg wrote: > > > On 21/05/2024 16:22, Jeff Layton wrote: > > On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote: > > > There is an inherent race where a symlink file may have been > > > overriden > > > (by a different client) between lookup and readlink, resulting in > > > a > > > spurious EIO error returned to userspace. Fix this by propagating > > > back > > > ESTALE errors such that the vfs will retry the lookup/get_link > > > (similar > > > to nfs4_file_open) at least once. > > > > > > Cc: Dan Aloni <dan.aloni@vastdata.com> > > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > > > --- > > > Note that with this change the vfs should retry once for > > > ESTALE errors. However with an artificial reproducer of high > > > frequency symlink overrides, nothing prevents the retry to > > > also encounter ESTALE, propagating the error back to userspace. > > > The man pages for openat/readlinkat do not list an ESTALE errno. > > > > > > An alternative attempt (implemented by Dan) was a local retry > > > loop > > > in nfs_get_link(), if this is an applicable approach, Dan can > > > share his patch instead. > > > > > > fs/nfs/symlink.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c > > > index 0e27a2e4e68b..13818129d268 100644 > > > --- a/fs/nfs/symlink.c > > > +++ b/fs/nfs/symlink.c > > > @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file > > > *file, struct folio *folio) > > > error: > > > folio_set_error(folio); > > > folio_unlock(folio); > > > - return -EIO; > > > + return error; > > > } > > > > > > static const char *nfs_get_link(struct dentry *dentry, > > git blame seems to indicate that we've returned -EIO here since the > > beginning of the git era (and likely long before that). I see no > > reason > > for us to cloak the real error there though, especially with > > something > > like an ESTALE error. > > > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > > > > FWIW, I think we shouldn't try to do any retry looping on ESTALE > > beyond > > what we already do. > > > > Yes, we can sometimes trigger ESTALE errors to bubble up to > > userland if > > we really thrash the underlying filesystem when testing, but I > > think > > that's actually desirable: > > Returning ESTALE would be an improvement over returning EIO IMO, > but it may be surprising for userspace to see an undocumented errno. > Maybe the man pages can be amended? > > > > > If you have real workloads across multiple machines that are racing > > with other that tightly, then you should probably be using some > > sort of > > locking or other synchronization. If it's clever enough that it > > doesn''t need that, then it should be able to deal with the > > occasional > > ESTALE error by retrying on its own. > > I tend to agree. FWIW Solaris has a config knob for number of stale > retries > it does, maybe there is an appetite to have something like that as > well? > Any reason why we couldn't just return ENOENT in the case where the filehandle is stale? There will have been an unlink() on the symlink at some point in the recent past.
> On May 21, 2024, at 11:13 AM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Tue, 2024-05-21 at 18:05 +0300, Sagi Grimberg wrote: >> >> >> On 21/05/2024 16:22, Jeff Layton wrote: >>> On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote: >>>> There is an inherent race where a symlink file may have been >>>> overriden >>>> (by a different client) between lookup and readlink, resulting in >>>> a >>>> spurious EIO error returned to userspace. Fix this by propagating >>>> back >>>> ESTALE errors such that the vfs will retry the lookup/get_link >>>> (similar >>>> to nfs4_file_open) at least once. >>>> >>>> Cc: Dan Aloni <dan.aloni@vastdata.com> >>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> >>>> --- >>>> Note that with this change the vfs should retry once for >>>> ESTALE errors. However with an artificial reproducer of high >>>> frequency symlink overrides, nothing prevents the retry to >>>> also encounter ESTALE, propagating the error back to userspace. >>>> The man pages for openat/readlinkat do not list an ESTALE errno. >>>> >>>> An alternative attempt (implemented by Dan) was a local retry >>>> loop >>>> in nfs_get_link(), if this is an applicable approach, Dan can >>>> share his patch instead. >>>> >>>> fs/nfs/symlink.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c >>>> index 0e27a2e4e68b..13818129d268 100644 >>>> --- a/fs/nfs/symlink.c >>>> +++ b/fs/nfs/symlink.c >>>> @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file >>>> *file, struct folio *folio) >>>> error: >>>> folio_set_error(folio); >>>> folio_unlock(folio); >>>> - return -EIO; >>>> + return error; >>>> } >>>> >>>> static const char *nfs_get_link(struct dentry *dentry, >>> git blame seems to indicate that we've returned -EIO here since the >>> beginning of the git era (and likely long before that). I see no >>> reason >>> for us to cloak the real error there though, especially with >>> something >>> like an ESTALE error. >>> >>> Reviewed-by: Jeff Layton <jlayton@kernel.org> >>> >>> FWIW, I think we shouldn't try to do any retry looping on ESTALE >>> beyond >>> what we already do. >>> >>> Yes, we can sometimes trigger ESTALE errors to bubble up to >>> userland if >>> we really thrash the underlying filesystem when testing, but I >>> think >>> that's actually desirable: >> >> Returning ESTALE would be an improvement over returning EIO IMO, >> but it may be surprising for userspace to see an undocumented errno. >> Maybe the man pages can be amended? >> >>> >>> If you have real workloads across multiple machines that are racing >>> with other that tightly, then you should probably be using some >>> sort of >>> locking or other synchronization. If it's clever enough that it >>> doesn''t need that, then it should be able to deal with the >>> occasional >>> ESTALE error by retrying on its own. >> >> I tend to agree. FWIW Solaris has a config knob for number of stale >> retries >> it does, maybe there is an appetite to have something like that as >> well? >> > > Any reason why we couldn't just return ENOENT in the case where the > filehandle is stale? There will have been an unlink() on the symlink at > some point in the recent past. To me ENOENT is preferable to both EIO and ESTALE. -- Chuck Lever
On 21/05/2024 18:13, Trond Myklebust wrote: > On Tue, 2024-05-21 at 18:05 +0300, Sagi Grimberg wrote: >> >> On 21/05/2024 16:22, Jeff Layton wrote: >>> On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote: >>>> There is an inherent race where a symlink file may have been >>>> overriden >>>> (by a different client) between lookup and readlink, resulting in >>>> a >>>> spurious EIO error returned to userspace. Fix this by propagating >>>> back >>>> ESTALE errors such that the vfs will retry the lookup/get_link >>>> (similar >>>> to nfs4_file_open) at least once. >>>> >>>> Cc: Dan Aloni <dan.aloni@vastdata.com> >>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> >>>> --- >>>> Note that with this change the vfs should retry once for >>>> ESTALE errors. However with an artificial reproducer of high >>>> frequency symlink overrides, nothing prevents the retry to >>>> also encounter ESTALE, propagating the error back to userspace. >>>> The man pages for openat/readlinkat do not list an ESTALE errno. >>>> >>>> An alternative attempt (implemented by Dan) was a local retry >>>> loop >>>> in nfs_get_link(), if this is an applicable approach, Dan can >>>> share his patch instead. >>>> >>>> fs/nfs/symlink.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c >>>> index 0e27a2e4e68b..13818129d268 100644 >>>> --- a/fs/nfs/symlink.c >>>> +++ b/fs/nfs/symlink.c >>>> @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file >>>> *file, struct folio *folio) >>>> error: >>>> folio_set_error(folio); >>>> folio_unlock(folio); >>>> - return -EIO; >>>> + return error; >>>> } >>>> >>>> static const char *nfs_get_link(struct dentry *dentry, >>> git blame seems to indicate that we've returned -EIO here since the >>> beginning of the git era (and likely long before that). I see no >>> reason >>> for us to cloak the real error there though, especially with >>> something >>> like an ESTALE error. >>> >>> Reviewed-by: Jeff Layton <jlayton@kernel.org> >>> >>> FWIW, I think we shouldn't try to do any retry looping on ESTALE >>> beyond >>> what we already do. >>> >>> Yes, we can sometimes trigger ESTALE errors to bubble up to >>> userland if >>> we really thrash the underlying filesystem when testing, but I >>> think >>> that's actually desirable: >> Returning ESTALE would be an improvement over returning EIO IMO, >> but it may be surprising for userspace to see an undocumented errno. >> Maybe the man pages can be amended? >> >>> If you have real workloads across multiple machines that are racing >>> with other that tightly, then you should probably be using some >>> sort of >>> locking or other synchronization. If it's clever enough that it >>> doesn''t need that, then it should be able to deal with the >>> occasional >>> ESTALE error by retrying on its own. >> I tend to agree. FWIW Solaris has a config knob for number of stale >> retries >> it does, maybe there is an appetite to have something like that as >> well? >> > Any reason why we couldn't just return ENOENT in the case where the > filehandle is stale? There will have been an unlink() on the symlink at > some point in the recent past. > No reason that I can see. However given that this was observed in the wild, and essentially a common pattern with symlinks (overwrite a config file for example), I think its reasonable to have the vfs at least do a single retry, by simply returning ESTALE. However NFS cannot distinguish between first and second retries afaict... Perhaps the vfs can help with a ESTALE->ENOENT conversion?
On 2024-05-21 15:24:19, Chuck Lever III wrote: > > > > On May 21, 2024, at 11:13 AM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > > > On Tue, 2024-05-21 at 18:05 +0300, Sagi Grimberg wrote: > >> > >> > >> On 21/05/2024 16:22, Jeff Layton wrote: > >>> On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote: > >>>> There is an inherent race where a symlink file may have been > >>>> overriden > >>>> (by a different client) between lookup and readlink, resulting in > >>>> a > >>>> spurious EIO error returned to userspace. Fix this by propagating > >>>> back > >>>> ESTALE errors such that the vfs will retry the lookup/get_link > >>>> (similar > >>>> to nfs4_file_open) at least once. > >>>> > >>>> Cc: Dan Aloni <dan.aloni@vastdata.com> > >>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > >>>> --- > >>>> Note that with this change the vfs should retry once for > >>>> ESTALE errors. However with an artificial reproducer of high > >>>> frequency symlink overrides, nothing prevents the retry to > >>>> also encounter ESTALE, propagating the error back to userspace. > >>>> The man pages for openat/readlinkat do not list an ESTALE errno. > >>>> > >>>> An alternative attempt (implemented by Dan) was a local retry > >>>> loop > >>>> in nfs_get_link(), if this is an applicable approach, Dan can > >>>> share his patch instead. > >>>> > >>>> fs/nfs/symlink.c | 2 +- > >>>> 1 file changed, 1 insertion(+), 1 deletion(-) > >>>> > >>>> diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c > >>>> index 0e27a2e4e68b..13818129d268 100644 > >>>> --- a/fs/nfs/symlink.c > >>>> +++ b/fs/nfs/symlink.c > >>>> @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file > >>>> *file, struct folio *folio) > >>>> error: > >>>> folio_set_error(folio); > >>>> folio_unlock(folio); > >>>> - return -EIO; > >>>> + return error; > >>>> } > >>>> > >>>> static const char *nfs_get_link(struct dentry *dentry, > >>> git blame seems to indicate that we've returned -EIO here since the > >>> beginning of the git era (and likely long before that). I see no > >>> reason > >>> for us to cloak the real error there though, especially with > >>> something > >>> like an ESTALE error. > >>> > >>> Reviewed-by: Jeff Layton <jlayton@kernel.org> > >>> > >>> FWIW, I think we shouldn't try to do any retry looping on ESTALE > >>> beyond > >>> what we already do. > >>> > >>> Yes, we can sometimes trigger ESTALE errors to bubble up to > >>> userland if > >>> we really thrash the underlying filesystem when testing, but I > >>> think > >>> that's actually desirable: > >> > >> Returning ESTALE would be an improvement over returning EIO IMO, > >> but it may be surprising for userspace to see an undocumented errno. > >> Maybe the man pages can be amended? > >> > >>> > >>> If you have real workloads across multiple machines that are racing > >>> with other that tightly, then you should probably be using some > >>> sort of > >>> locking or other synchronization. If it's clever enough that it > >>> doesn''t need that, then it should be able to deal with the > >>> occasional > >>> ESTALE error by retrying on its own. > >> > >> I tend to agree. FWIW Solaris has a config knob for number of stale > >> retries > >> it does, maybe there is an appetite to have something like that as > >> well? > >> > > > > Any reason why we couldn't just return ENOENT in the case where the > > filehandle is stale? There will have been an unlink() on the symlink at > > some point in the recent past. > > To me ENOENT is preferable to both EIO and ESTALE. Another view on that, where in the scenario of `rename` causing the unlinking, there was no situation of 'no entry' as the directory entry was only updated and not removed. So ENOENT in this regard by the meaning of 'no entry' would not reflect what has really happened. (unless you go with the 'no entity' interpretation of ENOENT, but that would be against most of the POSIX-spec cases where ENOENT is returned which deal primarily with missing path components i.e. names to objects and not the objects themselves)
On Wed, 2024-05-22 at 07:41 +0300, Dan Aloni wrote: > On 2024-05-21 15:24:19, Chuck Lever III wrote: > > > > > > > On May 21, 2024, at 11:13 AM, Trond Myklebust > > > <trondmy@hammerspace.com> wrote: > > > > > > On Tue, 2024-05-21 at 18:05 +0300, Sagi Grimberg wrote: > > > > > > > > > > > > On 21/05/2024 16:22, Jeff Layton wrote: > > > > > On Tue, 2024-05-21 at 15:58 +0300, Sagi Grimberg wrote: > > > > > > There is an inherent race where a symlink file may have > > > > > > been > > > > > > overriden > > > > > > (by a different client) between lookup and readlink, > > > > > > resulting in > > > > > > a > > > > > > spurious EIO error returned to userspace. Fix this by > > > > > > propagating > > > > > > back > > > > > > ESTALE errors such that the vfs will retry the > > > > > > lookup/get_link > > > > > > (similar > > > > > > to nfs4_file_open) at least once. > > > > > > > > > > > > Cc: Dan Aloni <dan.aloni@vastdata.com> > > > > > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > > > > > > --- > > > > > > Note that with this change the vfs should retry once for > > > > > > ESTALE errors. However with an artificial reproducer of > > > > > > high > > > > > > frequency symlink overrides, nothing prevents the retry to > > > > > > also encounter ESTALE, propagating the error back to > > > > > > userspace. > > > > > > The man pages for openat/readlinkat do not list an ESTALE > > > > > > errno. > > > > > > > > > > > > An alternative attempt (implemented by Dan) was a local > > > > > > retry > > > > > > loop > > > > > > in nfs_get_link(), if this is an applicable approach, Dan > > > > > > can > > > > > > share his patch instead. > > > > > > > > > > > > fs/nfs/symlink.c | 2 +- > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c > > > > > > index 0e27a2e4e68b..13818129d268 100644 > > > > > > --- a/fs/nfs/symlink.c > > > > > > +++ b/fs/nfs/symlink.c > > > > > > @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file > > > > > > *file, struct folio *folio) > > > > > > error: > > > > > > folio_set_error(folio); > > > > > > folio_unlock(folio); > > > > > > - return -EIO; > > > > > > + return error; > > > > > > } > > > > > > > > > > > > static const char *nfs_get_link(struct dentry *dentry, > > > > > git blame seems to indicate that we've returned -EIO here > > > > > since the > > > > > beginning of the git era (and likely long before that). I see > > > > > no > > > > > reason > > > > > for us to cloak the real error there though, especially with > > > > > something > > > > > like an ESTALE error. > > > > > > > > > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > > > > > > > > > > FWIW, I think we shouldn't try to do any retry looping on > > > > > ESTALE > > > > > beyond > > > > > what we already do. > > > > > > > > > > Yes, we can sometimes trigger ESTALE errors to bubble up to > > > > > userland if > > > > > we really thrash the underlying filesystem when testing, but > > > > > I > > > > > think > > > > > that's actually desirable: > > > > > > > > Returning ESTALE would be an improvement over returning EIO > > > > IMO, > > > > but it may be surprising for userspace to see an undocumented > > > > errno. > > > > Maybe the man pages can be amended? > > > > > > > > > > > > > > If you have real workloads across multiple machines that are > > > > > racing > > > > > with other that tightly, then you should probably be using > > > > > some > > > > > sort of > > > > > locking or other synchronization. If it's clever enough that > > > > > it > > > > > doesn''t need that, then it should be able to deal with the > > > > > occasional > > > > > ESTALE error by retrying on its own. > > > > > > > > I tend to agree. FWIW Solaris has a config knob for number of > > > > stale > > > > retries > > > > it does, maybe there is an appetite to have something like that > > > > as > > > > well? > > > > > > > > > > Any reason why we couldn't just return ENOENT in the case where > > > the > > > filehandle is stale? There will have been an unlink() on the > > > symlink at > > > some point in the recent past. > > > > To me ENOENT is preferable to both EIO and ESTALE. > > Another view on that, where in the scenario of `rename` causing the > unlinking, there was no situation of 'no entry' as the directory > entry > was only updated and not removed. So ENOENT in this regard by the > meaning of 'no entry' would not reflect what has really happened. > > (unless you go with the 'no entity' interpretation of ENOENT, but > that > would be against most of the POSIX-spec cases where ENOENT is > returned > which deal primarily with missing path components i.e. names to > objects and not the objects themselves) > The Linux NFS client doesn't support volatile filehandles.
Hey Trond, >> filehandle is stale? There will have been an unlink() on the symlink at >> some point in the recent past. >> > > No reason that I can see. However given that this was observed in the > wild, and essentially > a common pattern with symlinks (overwrite a config file for example), > I think its reasonable > to have the vfs at least do a single retry, by simply returning ESTALE. > However NFS cannot distinguish between first and second retries > afaict... Perhaps the > vfs can help with a ESTALE->ENOENT conversion? So what do you suggest we do here? IMO at a minimum NFS should retry once similar to nfs4_file_open (it would probably address 99.9% of the use-cases where symlinks are not overwritten in a high enough frequency for the client to see 2 consecutive stale readlink rpc rplies). I can send a patch paired with a vfs ESTALE conversion patch? alternatively retry locally in NFS... I would like to understand your position here.
On Wed, 2024-05-22 at 22:40 +0300, Sagi Grimberg wrote: > Hey Trond, > > > > filehandle is stale? There will have been an unlink() on the > > > symlink at > > > some point in the recent past. > > > > > > > No reason that I can see. However given that this was observed in > > the > > wild, and essentially > > a common pattern with symlinks (overwrite a config file for > > example), > > I think its reasonable > > to have the vfs at least do a single retry, by simply returning > > ESTALE. > > However NFS cannot distinguish between first and second retries > > afaict... Perhaps the > > vfs can help with a ESTALE->ENOENT conversion? > > So what do you suggest we do here? IMO at a minimum NFS should retry > once similar > to nfs4_file_open (it would probably address 99.9% of the use-cases > where symlinks are > not overwritten in a high enough frequency for the client to see 2 > consecutive stale readlink > rpc rplies). > > I can send a patch paired with a vfs ESTALE conversion patch? > alternatively retry locally in NFS... > I would like to understand your position here. > Looking more closely at nfs_get_link(), it is obvious that it can already return ESTALE (thanks to the call to nfs_revalidate_mapping()) and looking at do_readlinkat(), it has already been plumbed through with a call to retry_estale(). So I think we can take your patch as is, since it doesn't add any error cases that callers of readlink() don't have to handle already. We might still want to think about cleaning up the output of the VFS in all these cases, so that we don't return ESTALE when it isn't allowed by POSIX, but that would be a separate task.
>> So what do you suggest we do here? IMO at a minimum NFS should retry >> once similar >> to nfs4_file_open (it would probably address 99.9% of the use-cases >> where symlinks are >> not overwritten in a high enough frequency for the client to see 2 >> consecutive stale readlink >> rpc rplies). >> >> I can send a patch paired with a vfs ESTALE conversion patch? >> alternatively retry locally in NFS... >> I would like to understand your position here. >> > Looking more closely at nfs_get_link(), it is obvious that it can > already return ESTALE (thanks to the call to nfs_revalidate_mapping()) > and looking at do_readlinkat(), it has already been plumbed through > with a call to retry_estale(). > > So I think we can take your patch as is, since it doesn't add any error > cases that callers of readlink() don't have to handle already. Sounds good. > > We might still want to think about cleaning up the output of the VFS in > all these cases, so that we don't return ESTALE when it isn't allowed > by POSIX, but that would be a separate task. > Yes, I can follow up on that later...
On Wed, 2024-05-22 at 21:04 +0000, Trond Myklebust wrote: > On Wed, 2024-05-22 at 22:40 +0300, Sagi Grimberg wrote: > > Hey Trond, > > > > > > filehandle is stale? There will have been an unlink() on the > > > > symlink at > > > > some point in the recent past. > > > > > > > > > > No reason that I can see. However given that this was observed in > > > the > > > wild, and essentially > > > a common pattern with symlinks (overwrite a config file for > > > example), > > > I think its reasonable > > > to have the vfs at least do a single retry, by simply returning > > > ESTALE. > > > However NFS cannot distinguish between first and second retries > > > afaict... Perhaps the > > > vfs can help with a ESTALE->ENOENT conversion? > > > > So what do you suggest we do here? IMO at a minimum NFS should retry > > once similar > > to nfs4_file_open (it would probably address 99.9% of the use-cases > > where symlinks are > > not overwritten in a high enough frequency for the client to see 2 > > consecutive stale readlink > > rpc rplies). > > > > I can send a patch paired with a vfs ESTALE conversion patch? > > alternatively retry locally in NFS... > > I would like to understand your position here. > > > > Looking more closely at nfs_get_link(), it is obvious that it can > already return ESTALE (thanks to the call to nfs_revalidate_mapping()) > and looking at do_readlinkat(), it has already been plumbed through > with a call to retry_estale(). > > So I think we can take your patch as is, since it doesn't add any error > cases that callers of readlink() don't have to handle already. > > We might still want to think about cleaning up the output of the VFS in > all these cases, so that we don't return ESTALE when it isn't allowed > by POSIX, but that would be a separate task. > I think we can effectively turn ESTALE into ENOENT in most (all?) syscalls that take a pathname, since you can argue that it would have been an ENOENT had we gotten in there just a little later. To fix this the right way, I think you'd have to plumb this translation into most path-based syscall handlers at a fairly high level. Maybe we need some sort of generic sanitize_errno() handler that we call from these sorts of calls? In any case, I think that's a somewhat larger project. :)
On 23/05/2024 0:19, Sagi Grimberg wrote: > >>> So what do you suggest we do here? IMO at a minimum NFS should retry >>> once similar >>> to nfs4_file_open (it would probably address 99.9% of the use-cases >>> where symlinks are >>> not overwritten in a high enough frequency for the client to see 2 >>> consecutive stale readlink >>> rpc rplies). >>> >>> I can send a patch paired with a vfs ESTALE conversion patch? >>> alternatively retry locally in NFS... >>> I would like to understand your position here. >>> >> Looking more closely at nfs_get_link(), it is obvious that it can >> already return ESTALE (thanks to the call to nfs_revalidate_mapping()) >> and looking at do_readlinkat(), it has already been plumbed through >> with a call to retry_estale(). >> >> So I think we can take your patch as is, since it doesn't add any error >> cases that callers of readlink() don't have to handle already. > > Sounds good. > >> >> We might still want to think about cleaning up the output of the VFS in >> all these cases, so that we don't return ESTALE when it isn't allowed >> by POSIX, but that would be a separate task. >> > > Yes, I can follow up on that later... > Hey Trond, is there anything else you are expecting to see before this is taken to your tree?
On Thu, 2024-05-30 at 21:08 +0300, Sagi Grimberg wrote: > > > On 23/05/2024 0:19, Sagi Grimberg wrote: > > > > > > So what do you suggest we do here? IMO at a minimum NFS should > > > > retry > > > > once similar > > > > to nfs4_file_open (it would probably address 99.9% of the use- > > > > cases > > > > where symlinks are > > > > not overwritten in a high enough frequency for the client to > > > > see 2 > > > > consecutive stale readlink > > > > rpc rplies). > > > > > > > > I can send a patch paired with a vfs ESTALE conversion patch? > > > > alternatively retry locally in NFS... > > > > I would like to understand your position here. > > > > > > > Looking more closely at nfs_get_link(), it is obvious that it can > > > already return ESTALE (thanks to the call to > > > nfs_revalidate_mapping()) > > > and looking at do_readlinkat(), it has already been plumbed > > > through > > > with a call to retry_estale(). > > > > > > So I think we can take your patch as is, since it doesn't add any > > > error > > > cases that callers of readlink() don't have to handle already. > > > > Sounds good. > > > > > > > > We might still want to think about cleaning up the output of the > > > VFS in > > > all these cases, so that we don't return ESTALE when it isn't > > > allowed > > > by POSIX, but that would be a separate task. > > > > > > > Yes, I can follow up on that later... > > > > Hey Trond, > is there anything else you are expecting to see before this is taken > to > your tree? > It's already queued in my testing branch: https://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/testing I'll probably push that into the linux-next branch over the weekend.
>> Hey Trond, >> is there anything else you are expecting to see before this is taken >> to >> your tree? >> > It's already queued in my testing branch: > https://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=shortlog;h=refs/heads/testing > I'll probably push that into the linux-next branch over the weekend. > Ah, I was looking at your linux-next. Thanks
diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c index 0e27a2e4e68b..13818129d268 100644 --- a/fs/nfs/symlink.c +++ b/fs/nfs/symlink.c @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file *file, struct folio *folio) error: folio_set_error(folio); folio_unlock(folio); - return -EIO; + return error; } static const char *nfs_get_link(struct dentry *dentry,
There is an inherent race where a symlink file may have been overriden (by a different client) between lookup and readlink, resulting in a spurious EIO error returned to userspace. Fix this by propagating back ESTALE errors such that the vfs will retry the lookup/get_link (similar to nfs4_file_open) at least once. Cc: Dan Aloni <dan.aloni@vastdata.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- Note that with this change the vfs should retry once for ESTALE errors. However with an artificial reproducer of high frequency symlink overrides, nothing prevents the retry to also encounter ESTALE, propagating the error back to userspace. The man pages for openat/readlinkat do not list an ESTALE errno. An alternative attempt (implemented by Dan) was a local retry loop in nfs_get_link(), if this is an applicable approach, Dan can share his patch instead. fs/nfs/symlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)