diff mbox

NFSDv4: use export cache flushtime for changeid on V4ROOT objects.

Message ID 87mve9rs0z.fsf@notabene.neil.brown.name (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown Jan. 30, 2017, 6:17 a.m. UTC
If you change the set of filesystems that are exported, then
the contents of various directories in the NFSv4 pseudo-root
is likely to change.  However the change-id of those
directories is currently tied to the underlying directory,
so the clinet may not see the changes in a timely fashion.

This patch changes the change-id number to be derived from the
"flush_time" of the export cache.  Whenever any changes are
made to the set of exported filesystems, this flush_time is
updated.  The result is that clients see changes to the set
of exported filesystems much more quickly, often immediately.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/nfsd/nfs4xdr.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

J. Bruce Fields Jan. 30, 2017, 3:35 p.m. UTC | #1
On Mon, Jan 30, 2017 at 05:17:00PM +1100, NeilBrown wrote:
> 
> If you change the set of filesystems that are exported, then
> the contents of various directories in the NFSv4 pseudo-root
> is likely to change.  However the change-id of those
> directories is currently tied to the underlying directory,
> so the clinet may not see the changes in a timely fashion.

Oh, good catch.

> This patch changes the change-id number to be derived from the
> "flush_time" of the export cache.  Whenever any changes are
> made to the set of exported filesystems, this flush_time is
> updated.  The result is that clients see changes to the set
> of exported filesystems much more quickly, often immediately.

And, a clever solution, as usual....

I wonder if it's completely right yet, though.  Off the top of my head:
can't the client see the new flush time before it sees the new contents?
If so, a client that caches both during that window could cache the old
contents indefinitely.

--b.

> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/nfsd/nfs4xdr.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 8fae53ce21d1..dbff0122b784 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1966,9 +1966,13 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
>  	DECODE_TAIL;
>  }
>  
> -static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode)
> +static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
> +			     struct svc_export *exp)
>  {
> -	if (IS_I_VERSION(inode)) {
> +	if (exp->ex_flags & NFSEXP_V4ROOT) {
> +		*p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> +		*p++ = 0;
> +	} else if (IS_I_VERSION(inode)) {
>  		p = xdr_encode_hyper(p, inode->i_version);
>  	} else {
>  		*p++ = cpu_to_be32(stat->ctime.tv_sec);
> @@ -2490,7 +2494,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  		p = xdr_reserve_space(xdr, 8);
>  		if (!p)
>  			goto out_resource;
> -		p = encode_change(p, &stat, d_inode(dentry));
> +		p = encode_change(p, &stat, d_inode(dentry), exp);
>  	}
>  	if (bmval0 & FATTR4_WORD0_SIZE) {
>  		p = xdr_reserve_space(xdr, 8);
> -- 
> 2.11.0
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown Jan. 30, 2017, 10:28 p.m. UTC | #2
On Mon, Jan 30 2017, J. Bruce Fields wrote:

> On Mon, Jan 30, 2017 at 05:17:00PM +1100, NeilBrown wrote:
>> 
>> If you change the set of filesystems that are exported, then
>> the contents of various directories in the NFSv4 pseudo-root
>> is likely to change.  However the change-id of those
>> directories is currently tied to the underlying directory,
>> so the clinet may not see the changes in a timely fashion.
>
> Oh, good catch.
>
>> This patch changes the change-id number to be derived from the
>> "flush_time" of the export cache.  Whenever any changes are
>> made to the set of exported filesystems, this flush_time is
>> updated.  The result is that clients see changes to the set
>> of exported filesystems much more quickly, often immediately.
>
> And, a clever solution, as usual....
>
> I wonder if it's completely right yet, though.  Off the top of my head:
> can't the client see the new flush time before it sees the new contents?
> If so, a client that caches both during that window could cache the old
> contents indefinitely.

uhm....
Yes, it could see the new flush time before it sees the new contents.
When it sees that new flush time (i.e. new change attribute), it will
invalidate its cached contents and ask for the contents again.  It will
then definitely get new contents.

Or do you see something that I don't?

Thanks,
NeilBrown
J. Bruce Fields Jan. 31, 2017, 2:38 p.m. UTC | #3
On Tue, Jan 31, 2017 at 09:28:37AM +1100, NeilBrown wrote:
> On Mon, Jan 30 2017, J. Bruce Fields wrote:
> 
> > On Mon, Jan 30, 2017 at 05:17:00PM +1100, NeilBrown wrote:
> >> 
> >> If you change the set of filesystems that are exported, then
> >> the contents of various directories in the NFSv4 pseudo-root
> >> is likely to change.  However the change-id of those
> >> directories is currently tied to the underlying directory,
> >> so the clinet may not see the changes in a timely fashion.
> >
> > Oh, good catch.
> >
> >> This patch changes the change-id number to be derived from the
> >> "flush_time" of the export cache.  Whenever any changes are
> >> made to the set of exported filesystems, this flush_time is
> >> updated.  The result is that clients see changes to the set
> >> of exported filesystems much more quickly, often immediately.
> >
> > And, a clever solution, as usual....
> >
> > I wonder if it's completely right yet, though.  Off the top of my head:
> > can't the client see the new flush time before it sees the new contents?
> > If so, a client that caches both during that window could cache the old
> > contents indefinitely.
> 
> uhm....
> Yes, it could see the new flush time before it sees the new contents.
> When it sees that new flush time (i.e. new change attribute), it will
> invalidate its cached contents and ask for the contents again.

The problem comes if it's still possible for the client to read (and
cache) the old contents at this point, in which case the client's cache
will incorrectly associate old contents with new change attribute.

> It will then definitely get new contents.

So the problem with changing change attribute before contents is:

	- client retrieves old contents and new attribute, caches.
	- client revalidates cache at an arbitrarily later time, sees
	  it's still the new attribute, continues caching old contents.

So usually I believe you want the two changes--contents and change
attribute--to be atomic or, if that's not possible, for them to be
changed in that order.

I haven't thought through how that applies to this case, but I think it
should be possible if in-progress rpc's hold references to objects in
the flushed cache?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown Feb. 6, 2017, 9:07 p.m. UTC | #4
On Tue, Jan 31 2017, J. Bruce Fields wrote:

> On Tue, Jan 31, 2017 at 09:28:37AM +1100, NeilBrown wrote:
>> On Mon, Jan 30 2017, J. Bruce Fields wrote:
>> 
>> > On Mon, Jan 30, 2017 at 05:17:00PM +1100, NeilBrown wrote:
>> >> 
>> >> If you change the set of filesystems that are exported, then
>> >> the contents of various directories in the NFSv4 pseudo-root
>> >> is likely to change.  However the change-id of those
>> >> directories is currently tied to the underlying directory,
>> >> so the clinet may not see the changes in a timely fashion.
>> >
>> > Oh, good catch.
>> >
>> >> This patch changes the change-id number to be derived from the
>> >> "flush_time" of the export cache.  Whenever any changes are
>> >> made to the set of exported filesystems, this flush_time is
>> >> updated.  The result is that clients see changes to the set
>> >> of exported filesystems much more quickly, often immediately.
>> >
>> > And, a clever solution, as usual....
>> >
>> > I wonder if it's completely right yet, though.  Off the top of my head:
>> > can't the client see the new flush time before it sees the new contents?
>> > If so, a client that caches both during that window could cache the old
>> > contents indefinitely.
>> 
>> uhm....
>> Yes, it could see the new flush time before it sees the new contents.
>> When it sees that new flush time (i.e. new change attribute), it will
>> invalidate its cached contents and ask for the contents again.
>
> The problem comes if it's still possible for the client to read (and
> cache) the old contents at this point, in which case the client's cache
> will incorrectly associate old contents with new change attribute.

I agree with this.

>
>> It will then definitely get new contents.
>
> So the problem with changing change attribute before contents is:
>
> 	- client retrieves old contents and new attribute, caches.
> 	- client revalidates cache at an arbitrarily later time, sees
> 	  it's still the new attribute, continues caching old contents.
>
> So usually I believe you want the two changes--contents and change
> attribute--to be atomic or, if that's not possible, for them to be
> changed in that order.

I believe that setting ->flush_time atomically effects both changes.

>
> I haven't thought through how that applies to this case, but I think it
> should be possible if in-progress rpc's hold references to objects in
> the flushed cache?

How would it do that?
In NFSv4 'READDIR' and 'GETATTR' are separate operations.
If the client sends READDIR and then GETATTR, it must not assume that
the change number in the GETATTR reply implies anything about the
READDIR reply.
But it (presumably) sends them in the order other, so if GETATTR gets a
new change number, then when nfsd4_encode_dirent_fattr() calls
nfsd_crossmnt() it will find the changed to the exports table, though it
may need to wait for an upcall to complete.

You are right to be cautious, but I think ->flush_time effectively
provides the needed atomicity.

Thanks,
NeilBrown
J. Bruce Fields Feb. 6, 2017, 10:28 p.m. UTC | #5
On Tue, Feb 07, 2017 at 08:07:13AM +1100, NeilBrown wrote:
> On Tue, Jan 31 2017, J. Bruce Fields wrote:
> 
> > On Tue, Jan 31, 2017 at 09:28:37AM +1100, NeilBrown wrote:
> >> On Mon, Jan 30 2017, J. Bruce Fields wrote:
> >> 
> >> > On Mon, Jan 30, 2017 at 05:17:00PM +1100, NeilBrown wrote:
> >> >> 
> >> >> If you change the set of filesystems that are exported, then
> >> >> the contents of various directories in the NFSv4 pseudo-root
> >> >> is likely to change.  However the change-id of those
> >> >> directories is currently tied to the underlying directory,
> >> >> so the clinet may not see the changes in a timely fashion.
> >> >
> >> > Oh, good catch.
> >> >
> >> >> This patch changes the change-id number to be derived from the
> >> >> "flush_time" of the export cache.  Whenever any changes are
> >> >> made to the set of exported filesystems, this flush_time is
> >> >> updated.  The result is that clients see changes to the set
> >> >> of exported filesystems much more quickly, often immediately.
> >> >
> >> > And, a clever solution, as usual....
> >> >
> >> > I wonder if it's completely right yet, though.  Off the top of my head:
> >> > can't the client see the new flush time before it sees the new contents?
> >> > If so, a client that caches both during that window could cache the old
> >> > contents indefinitely.
> >> 
> >> uhm....
> >> Yes, it could see the new flush time before it sees the new contents.
> >> When it sees that new flush time (i.e. new change attribute), it will
> >> invalidate its cached contents and ask for the contents again.
> >
> > The problem comes if it's still possible for the client to read (and
> > cache) the old contents at this point, in which case the client's cache
> > will incorrectly associate old contents with new change attribute.
> 
> I agree with this.
> 
> >
> >> It will then definitely get new contents.
> >
> > So the problem with changing change attribute before contents is:
> >
> > 	- client retrieves old contents and new attribute, caches.
> > 	- client revalidates cache at an arbitrarily later time, sees
> > 	  it's still the new attribute, continues caching old contents.
> >
> > So usually I believe you want the two changes--contents and change
> > attribute--to be atomic or, if that's not possible, for them to be
> > changed in that order.
> 
> I believe that setting ->flush_time atomically effects both changes.
> 
> >
> > I haven't thought through how that applies to this case, but I think it
> > should be possible if in-progress rpc's hold references to objects in
> > the flushed cache?
> 
> How would it do that?
> In NFSv4 'READDIR' and 'GETATTR' are separate operations.
> If the client sends READDIR and then GETATTR, it must not assume that
> the change number in the GETATTR reply implies anything about the
> READDIR reply.
> But it (presumably) sends them in the order other, so if GETATTR gets a
> new change number, then when nfsd4_encode_dirent_fattr() calls
> nfsd_crossmnt() it will find the changed to the exports table, though it
> may need to wait for an upcall to complete.
> 
> You are right to be cautious, but I think ->flush_time effectively
> provides the needed atomicity.

Yeah, I just hadn't thought it through.  So long as the only "content"
we care about is readdir/lookup results, and so long as those always
require nfsd_crossmnt() and a new cache lookup, then I agree this works.
Thanks!

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 8fae53ce21d1..dbff0122b784 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1966,9 +1966,13 @@  nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
 	DECODE_TAIL;
 }
 
-static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode)
+static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode,
+			     struct svc_export *exp)
 {
-	if (IS_I_VERSION(inode)) {
+	if (exp->ex_flags & NFSEXP_V4ROOT) {
+		*p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
+		*p++ = 0;
+	} else if (IS_I_VERSION(inode)) {
 		p = xdr_encode_hyper(p, inode->i_version);
 	} else {
 		*p++ = cpu_to_be32(stat->ctime.tv_sec);
@@ -2490,7 +2494,7 @@  nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 		p = xdr_reserve_space(xdr, 8);
 		if (!p)
 			goto out_resource;
-		p = encode_change(p, &stat, d_inode(dentry));
+		p = encode_change(p, &stat, d_inode(dentry), exp);
 	}
 	if (bmval0 & FATTR4_WORD0_SIZE) {
 		p = xdr_reserve_space(xdr, 8);