diff mbox

[10/10,v7] nfsd: Allows user un-mounting filesystem where nfsd exports base on

Message ID 20150713133934.6a4ef77d@noble (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown July 13, 2015, 3:39 a.m. UTC
On Sat, 11 Jul 2015 20:52:56 +0800 Kinglong Mee <kinglongmee@gmail.com>
wrote:

> If there are some mount points(not exported for nfs) under pseudo root,
> after client's operation of those entry under the root,  anyone *can't*
> unmount those mount points until export cache expired.
> 
> /nfs/xfs        *(rw,insecure,no_subtree_check,no_root_squash)
> /nfs/pnfs       *(rw,insecure,no_subtree_check,no_root_squash)
> total 0
> drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
> drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
> drwxr-xr-x. 2 root root  6 Apr 20 22:01 xfs
> Filesystem                      1K-blocks    Used Available Use% Mounted on
> ......
> /dev/sdd                          1038336   32944   1005392   4% /nfs/pnfs
> /dev/sdc                         10475520   32928  10442592   1% /nfs/xfs
> /dev/sde                           999320    1284    929224   1% /nfs/test
> /mnt/pnfs/:
> total 0
> -rw-r--r--. 1 root root 0 Apr 21 22:23 attr
> drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp
> 
> /mnt/xfs/:
> total 0
> umount: /nfs/test/: target is busy
>         (In some cases useful info about processes that
>         use the device is found by lsof(8) or fuser(1).)
> 
> It's caused by exports cache of nfsd holds the reference of
> the path (here is /nfs/test/), so, it can't be umounted.
> 
> I don't think that's user expect, they want umount /nfs/test/.
> Bruce think user can also umount /nfs/pnfs/ and /nfs/xfs.
> 
> Also, using kzalloc for all memory allocating without kmalloc.
> Thanks for Al Viro's commets for the logic of fs_pin.
> 
> v3,
> 1. using path_get_pin/path_put_unpin for path pin
> 2. using kzalloc for memory allocating
> 
> v4,
> 1. add a completion for pin_kill waiting the reference is decreased to zero.
> 2. add a work_struct for pin_kill decreases the reference indirectly.
> 3. free svc_export/svc_expkey in pin_kill, not svc_export_put/svc_expkey_put.
> 4. svc_export_put/svc_expkey_put go though pin_kill logic.
> 
> v5, same as v4.
> 
> v6,
> 1. Pin vfsmnt to mount point at first, when reference increace (==2),
>    grab a reference to vfsmnt by mntget. When decreace (==1),
>    drop the reference to vfsmnt, left pin.
> 2. Delete cache_head directly from cache_detail.
> 
> v7, 
> implement self reference increase and decrease for nfsd exports/expkey 
> 
> When reference of cahce_head increase(>1), grab a reference of mnt once.
> and reference decrease to 1 (==1), drop the reference of mnt.
> 
> So after that,
> When ref > 1, user cannot umount the filesystem with -EBUSY.
> when ref ==1, means cache only reference by nfsd cache,
> no other reference. So user can try umount,
> 1. before set MNT_UMOUNT (protected by mount_lock), nfsd cache is
>    referenced (ref > 1, legitimize_mntget), umount will fail with -EBUSY.
> 2. after set MNT_UMOUNT, nfsd cache is referenced (ref == 2),
>    legitimize_mntget will fail, and set cache to CACHE_NEGATIVE,
>    and the reference will be dropped, re-back to 1.
>    So, pin_kill can delete the cache and umount success.
> 3. when umountting, no reference to nfsd cache,
>    pin_kill can delete the cache and umount success.
> 
> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>

Wow.... this is turning out to be a lot more complex that I imagined at
first (isn't that always the way!).

There is a lot of good stuff here, but I think we can probably make it
simpler and so even better.

I particularly don't like the get_ref/put_ref pointers in cache_head.
They make cache_head a lot bigger than it was before, and they are only
needed for two specific caches.  And then they are the same for every element
in the cache.

I also don't like the ref_mutex ... or I don't like where it is used...
or something.  I definitely don't think we need one per cached entry.
Maybe one per cache.

I can certainly see that the "first" time we get a reference to a cache
item that holds a vfsmnt pointer, we need to "legitimize" that - or
fail.  But I don't think that has to happen inside the cache.c
machinery.

How about this:
 - add a new cache flag "CACHE_ACTIVE" (for example) which the cache
   owner can set whenever it likes.  When cache_put finds that CACHE_ACTIVE
   is set when refcount is <= 2, it calls a new cache_detail method: cache_deactivate.
 - cache_deactivate takes a mutex (yes, we do need one, don't we)
   and if CACHE_ACTIVE is still set and refcount is still <= 2,
   it drops the reference on the vfsmnt and clears CACHE_ACTIVE.
   This actually needs to be something like:
    if (test_and_clear_bit(CACHE_ACTIVE,...)) {
        if (atomic_read(..refcnt) > 2) {
             set_bit(CACHE_ACTIVE);
             mutex_unlock()
             return

   so that if other code gets a reference and tests CACHE_ACTIVE, it
   won't suddenly become inactive.  Might need a memory barrier in there...
   no, test_and_clear implies a memory barrier.

We only need to make changes to svc_export and svc_expkey - right?
So that would be:
 Change svc_export_lookup and svc_expkey_lookup so they look something
 like:

  svc_XX_lookup(struct cache_detail *cd, struct svc_XXX *item)
  {
      struct cache_head *ch;
      int hash = svc_XXX_hash(item);

      ch = sunrpc_cache_lookup(cd, &item->h, hash);
      if (!ch)
           return NULL;
      item = container_of(ch, struct svc_XXX, h);
      if (!test_bit(CACHE_VALID, &ch->flags) ||
          test_bit(CACHE_NEGATIVE, &ch->flags) ||
          test_bit(CACHE_ACTIVE, &ch->flags))
            return item;

      mutex_lock(&svc_XXX_mutex);
      if (!test_bit(CACHE_ACTIVE, &ch->flags)) {
              if (legitimize_mnt_get() == NULL) {
                      XXX_put(item);
                      item = NULL;
              } else
                      set_bit(CACHE_ACTIVE, &ch->flags);
      }
      mutex_unlock(&something);
      return item;
 }

Then the new 'cache_deactivate' function is something like:

  svc_XXX_deactivate(struct cache_detail *cd, struct cache_head *ch)
  {
       struct svc_XXX *item = container_of(ch, &item->h, item);

       mutex_lock(&svc_XXX_mutex);
       if (test_and_clear_bit(CACHE_ACTIVE, &ch->flags)) {
              if (atomic_read(&ch->ref.refcount) > 2) {
                   /* Race with get_ref - do nothing */
                   set_bit(CACHE_ACTIVE, &ch->flags);
              else
                   mntput(....mnt);
       }
       mutex_unlock(&svc_XXX_mutex);
  }


cache_put would have:

    if (test_bit(CACHE_ACTIVE, &h->flags) &&
        cd->cache_deactivate &&
        atomic_read(&h->ref.refcount <= 2))
           cd->cache_deactivate(cd, h);

but there is still a race.  If: (T1 and T2 are threads)
   T1: cache_put finds refcount is 2 and CACHE_ACTIVE is set and calls ->cache_deactiveate
   T2: cache_get increments the refcount to 3
   T1: cache_deactivate clears CACHE_ACTIVE and find refcount is 3
   T2: now calls cache_put, which sees CACHE_ACTIVE is clear so refcount becomes 2
   T1: sets CACHE_ACTIVE again and continues.  refcount becomes 1.

So not refcount is 1 and the item is still active.

We can fix this by making cache_put loop:
    while (test_bit(CACHE_ACTIVE, &h->flags) &&
          cd->cache_deactivate &&
          (smb_rmb(), 1) &&
          atomic_read(&h->ref.refcount <= 2))
           cd->cache_deactivate(cd, h);

This should ensure that refcount never gets to 1 with the
item still active (i.e. with a ref count on the mnt).


The work item and completion are a bit unfortunate too.

I guess the problem here is that pin_kill() can run while there are
some inactive references to the cache item.  There can then be a race
over who will use path_put_unpin to put the dentry.

Could we fix that by having expXXX_pin_kill() use kref_get_unless_zero()
on the cache item.
If that succeeds, then path_put_unpin hasn't been called and it won't be.
So expXXX_pin_kill can call it and then set CACHE_NEGATIVE.
If it fails, then it has already been called and nothing else need be done.
Almost.
If kref_get_unless_zero() fails, pin_remove() may not have been called
yet, but it will be soon.  We might need to wait.
It would be nice if pin_kill() would check ->done again after calling p->kill.
e.g.


I think that would close the last gap, without needing extra work
items and completion in the nfsd code.

Al: would you be OK with that change to pin_kill?

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Al Viro July 13, 2015, 4:02 a.m. UTC | #1
On Mon, Jul 13, 2015 at 01:39:34PM +1000, NeilBrown wrote:
> It would be nice if pin_kill() would check ->done again after calling p->kill.
> e.g.
> 
> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> index 611b5408f6ec..c2ef5c9d4c0d 100644
> --- a/fs/fs_pin.c
> +++ b/fs/fs_pin.c
> @@ -47,7 +47,9 @@ void pin_kill(struct fs_pin *p)
>  		spin_unlock_irq(&p->wait.lock);
>  		rcu_read_unlock();
>  		p->kill(p);
> -		return;
> +		if (p->done > 0)
> +			return;
> +		spin_lock_irq(&p->wait.lock);
>  	}
>  	if (p->done > 0) {
>  		spin_unlock_irq(&p->wait.lock);
> 
> I think that would close the last gap, without needing extra work
> items and completion in the nfsd code.
> 
> Al: would you be OK with that change to pin_kill?

Hell, no.  Intended use is to have ->kill() free the damn thing, period.
This code is very careful about how it waits in the "found it before
->kill() has removed all pointers leading to that object" case.  No go.
This change would break all existing users, with arseloads of extra
complications for no good reason whatsoever.

And frankly, I'm still not convinced that fs_pin is a good match for the
problem here - I'm not saying it's impossible to produce an fs_pin-based
solution (and I hadn't reviewed the latest iteration yet), but attempts so
far didn't look particularly promising.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 13, 2015, 4:20 a.m. UTC | #2
On Mon, 13 Jul 2015 13:39:34 +1000 NeilBrown <neilb@suse.com> wrote:


> The work item and completion are a bit unfortunate too.
> 
> I guess the problem here is that pin_kill() can run while there are
> some inactive references to the cache item.  There can then be a race
> over who will use path_put_unpin to put the dentry.
> 
> Could we fix that by having expXXX_pin_kill() use kref_get_unless_zero()
> on the cache item.
> If that succeeds, then path_put_unpin hasn't been called and it won't be.
> So expXXX_pin_kill can call it and then set CACHE_NEGATIVE.
> If it fails, then it has already been called and nothing else need be done.
> Almost.
> If kref_get_unless_zero() fails, pin_remove() may not have been called
> yet, but it will be soon.  We might need to wait.
> It would be nice if pin_kill() would check ->done again after calling p->kill.
> e.g.
> 
> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> index 611b5408f6ec..c2ef5c9d4c0d 100644
> --- a/fs/fs_pin.c
> +++ b/fs/fs_pin.c
> @@ -47,7 +47,9 @@ void pin_kill(struct fs_pin *p)
>  		spin_unlock_irq(&p->wait.lock);
>  		rcu_read_unlock();
>  		p->kill(p);
> -		return;
> +		if (p->done > 0)
> +			return;
> +		spin_lock_irq(&p->wait.lock);
>  	}
>  	if (p->done > 0) {
>  		spin_unlock_irq(&p->wait.lock);
> 
> I think that would close the last gap, without needing extra work
> items and completion in the nfsd code.
> 
> Al: would you be OK with that change to pin_kill?

Actually, with that change to pin_kill, this side of things becomes
really easy.
All expXXX_pin_kill needs to do is call your new cache_delete_entry.
If that doesn't cause the entry to be put, then something else has a
temporary reference which will be put soon.  In any case, pin_kill()
will wait long enough, but not indefinitely.
No need for kref_get_unless_zero() or any of that.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 13, 2015, 4:45 a.m. UTC | #3
On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:

> Actually, with that change to pin_kill, this side of things becomes
> really easy.
> All expXXX_pin_kill needs to do is call your new cache_delete_entry.
> If that doesn't cause the entry to be put, then something else has a
> temporary reference which will be put soon.  In any case, pin_kill()
> will wait long enough, but not indefinitely.
> No need for kref_get_unless_zero() or any of that.

No.  You are seriously misunderstanding what ->kill() is for and what the
existing instances are doing.  Again, there is no promise whatsoever that
the object containing fs_pin instance will *survive* past ->kill().
At all.

RTFS, please.  What is sorely missing in this recurring patchset is a clear
description of lifetime rules and ordering (who waits for whom and how long).
For all the objects involved.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 13, 2015, 5:19 a.m. UTC | #4
On Mon, 13 Jul 2015 05:02:58 +0100 Al Viro <viro@ZenIV.linux.org.uk>
wrote:

> On Mon, Jul 13, 2015 at 01:39:34PM +1000, NeilBrown wrote:
> > It would be nice if pin_kill() would check ->done again after calling p->kill.
> > e.g.
> > 
> > diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> > index 611b5408f6ec..c2ef5c9d4c0d 100644
> > --- a/fs/fs_pin.c
> > +++ b/fs/fs_pin.c
> > @@ -47,7 +47,9 @@ void pin_kill(struct fs_pin *p)
> >  		spin_unlock_irq(&p->wait.lock);
> >  		rcu_read_unlock();
> >  		p->kill(p);
> > -		return;
> > +		if (p->done > 0)
> > +			return;
> > +		spin_lock_irq(&p->wait.lock);
> >  	}
> >  	if (p->done > 0) {
> >  		spin_unlock_irq(&p->wait.lock);
> > 
> > I think that would close the last gap, without needing extra work
> > items and completion in the nfsd code.
> > 
> > Al: would you be OK with that change to pin_kill?
> 
> Hell, no.  Intended use is to have ->kill() free the damn thing, period.

It is not possible to revise that intention?
The apparent purpose of pin_kill() is to wait until the thing is freed,
or to trigger that freeing itself.  Why not do both: trigger then wait?


> This code is very careful about how it waits in the "found it before
> ->kill() has removed all pointers leading to that object" case.  No go.
> This change would break all existing users, with arseloads of extra
> complications for no good reason whatsoever.

Given that all current uses have ->kill() call pin_remove, and as
pin_remove sets ->done to 1, and as the patch makes no change to
behaviour when ->kill() completes with ->done set to 1, I don't see how
it can break anything.
'rcu' ensures that it is still save to examine p->done, and it will be
'1'.

Thanks,
NeilBrown


> 
> And frankly, I'm still not convinced that fs_pin is a good match for the
> problem here - I'm not saying it's impossible to produce an fs_pin-based
> solution (and I hadn't reviewed the latest iteration yet), but attempts so
> far didn't look particularly promising.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 13, 2015, 5:21 a.m. UTC | #5
On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
wrote:

> On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
> 
> > Actually, with that change to pin_kill, this side of things becomes
> > really easy.
> > All expXXX_pin_kill needs to do is call your new cache_delete_entry.
> > If that doesn't cause the entry to be put, then something else has a
> > temporary reference which will be put soon.  In any case, pin_kill()
> > will wait long enough, but not indefinitely.
> > No need for kref_get_unless_zero() or any of that.
> 
> No.  You are seriously misunderstanding what ->kill() is for and what the
> existing instances are doing.  Again, there is no promise whatsoever that
> the object containing fs_pin instance will *survive* past ->kill().
> At all.

Ah... I missed that rcu_read_unlock happened before ->kill.  Sorry
about that.

It still seems like the waiting that pin_kill does is exactly what we
need.

I'll think about it some more.

Thanks,
NeilBrown


> 
> RTFS, please.  What is sorely missing in this recurring patchset is a clear
> description of lifetime rules and ordering (who waits for whom and how long).
> For all the objects involved.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 13, 2015, 6:02 a.m. UTC | #6
On Mon, Jul 13, 2015 at 03:19:10PM +1000, NeilBrown wrote:
> > Hell, no.  Intended use is to have ->kill() free the damn thing, period.
> 
> It is not possible to revise that intention?
> The apparent purpose of pin_kill() is to wait until the thing is freed,
> or to trigger that freeing itself.  Why not do both: trigger then wait?

Huh?  The first to come calls ->kill(); anybody coming between that and
pin_remove() done by ->kill() waits until pin_remove() is called and
buggers off.

> Given that all current uses have ->kill() call pin_remove, and as
> pin_remove sets ->done to 1, and as the patch makes no change to
> behaviour when ->kill() completes with ->done set to 1, I don't see how
> it can break anything.
> 'rcu' ensures that it is still save to examine p->done, and it will be
> '1'.

RCU ensures no such thing.  Where's the rcu_read_lock() opening the
RCU-critical area you need?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 13, 2015, 6:02 a.m. UTC | #7
On Mon, 13 Jul 2015 15:21:33 +1000 NeilBrown <neilb@suse.com> wrote:

> On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
> wrote:
> 
> > On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
> > 
> > > Actually, with that change to pin_kill, this side of things becomes
> > > really easy.
> > > All expXXX_pin_kill needs to do is call your new cache_delete_entry.
> > > If that doesn't cause the entry to be put, then something else has a
> > > temporary reference which will be put soon.  In any case, pin_kill()
> > > will wait long enough, but not indefinitely.
> > > No need for kref_get_unless_zero() or any of that.
> > 
> > No.  You are seriously misunderstanding what ->kill() is for and what the
> > existing instances are doing.  Again, there is no promise whatsoever that
> > the object containing fs_pin instance will *survive* past ->kill().
> > At all.
> 
> Ah... I missed that rcu_read_unlock happened before ->kill.  Sorry
> about that.
> 
> It still seems like the waiting that pin_kill does is exactly what we
> need.
> 
> I'll think about it some more.
> 

Ok....

A key issue is that the code currently assumes that the only way a
'pin' is removed is by the pinned thing calling pin_kill().

The problem is that we want the pinning thing to be able to remove
itself.

I think that means we need a variant of pin_remove() which reports if
pin->done was 0 or -1.
If it was 0, then ->kill hasn't been called, and it won't be.  So the
caller is free to clean up how it likes (providing RCU is used for
freeing).
If it was -1, then ->kill has been called and is expected to clean up -
the caller should assume that has already happened.


So path_put_unpin() needs to call pin_remove_and_test() (or whatever it
is called) and return the value.

Then expXXX_put() calls path_put_unpin and only calls kfree_rcu() if
->done was previously 0.

expXXX_pin_kill() calls cache_delete_entry, and then calls pin_kill()
This recursive call to pin_kill() will wait until expXXX_put() has
called pin_remove_and_test() and then returns.
At this point there are no references to the cache entry except the one
that expXXX_pin_kill() holds.  So it can call kfree_rcu().


Would that work?
Are you happy with pin_remove() returning a status?

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 13, 2015, 6:08 a.m. UTC | #8
On Mon, Jul 13, 2015 at 04:02:43PM +1000, NeilBrown wrote:

> I think that means we need a variant of pin_remove() which reports if
> pin->done was 0 or -1.
> If it was 0, then ->kill hasn't been called, and it won't be.  So the
> caller is free to clean up how it likes (providing RCU is used for
> freeing).

Grr...  What will umount(2) wait for if it comes during your cleanup?
You are putting them in the wrong order - pin_remove() is "I'm *DONE*
killing that sucker, nothing to wait for", not "move along, don't wait
for me, I've taken over killing it off".
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 13, 2015, 6:32 a.m. UTC | #9
On Mon, 13 Jul 2015 07:08:02 +0100 Al Viro <viro@ZenIV.linux.org.uk>
wrote:

> On Mon, Jul 13, 2015 at 04:02:43PM +1000, NeilBrown wrote:
> 
> > I think that means we need a variant of pin_remove() which reports if
> > pin->done was 0 or -1.
> > If it was 0, then ->kill hasn't been called, and it won't be.  So the
> > caller is free to clean up how it likes (providing RCU is used for
> > freeing).
> 
> Grr...  What will umount(2) wait for if it comes during your cleanup?
> You are putting them in the wrong order - pin_remove() is "I'm *DONE*
> killing that sucker, nothing to wait for", not "move along, don't wait
> for me, I've taken over killing it off".

pin_remove() disconnects the pinning thing (sunrpc cache entry in this
case) from the pinned thing (vfsmnt in this case).
After it has run the pinned thing can do whatever it likes without any
reference to the pinning thing, and the pinning thing just needs to wait
an RCU grace period, and then can do whatever it likes.

The "cleanup" is, in this case, just a call to rcu_kfree().  There is
no need for umount(2) to wait for it.


Certainly any state that the pinning structure has that relates to the
pinned structure must be cleaned up before calling pin_remove, so for
example dput() must be called on path.dentry *before* pin_remove is
called on path.mnt.  But other parts of the pinning structure can be
handled as its owner chooses.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 13, 2015, 6:43 a.m. UTC | #10
On Mon, Jul 13, 2015 at 04:32:01PM +1000, NeilBrown wrote:
> pin_remove() disconnects the pinning thing (sunrpc cache entry in this
> case) from the pinned thing (vfsmnt in this case).
> After it has run the pinned thing can do whatever it likes without any
> reference to the pinning thing, and the pinning thing just needs to wait
> an RCU grace period, and then can do whatever it likes.
> 
> The "cleanup" is, in this case, just a call to rcu_kfree().  There is
> no need for umount(2) to wait for it.
> 
> 
> Certainly any state that the pinning structure has that relates to the
> pinned structure must be cleaned up before calling pin_remove, so for
> example dput() must be called on path.dentry *before* pin_remove is
> called on path.mnt.  But other parts of the pinning structure can be
> handled as its owner chooses.

Then what's the difference between that and what's getting done in ->kill()
triggered by cleanup_mnt()?

In any case, you have two possibilities - explicit unexport triggering that
dput(), etc. and umount(2) triggering the same.  Whoever comes second gets
to wait until it's done.  So why not make the point where we commit to
unexporting the sucker the place where we do pin_kill()?  And have ->kill()
of that thing prevent appearance of new entries, then wait for them to run
down.  Which is precisely the same thing you want to happen on umount...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 15, 2015, 3:49 a.m. UTC | #11
On Mon, 13 Jul 2015 07:43:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
wrote:

> On Mon, Jul 13, 2015 at 04:32:01PM +1000, NeilBrown wrote:
> > pin_remove() disconnects the pinning thing (sunrpc cache entry in this
> > case) from the pinned thing (vfsmnt in this case).
> > After it has run the pinned thing can do whatever it likes without any
> > reference to the pinning thing, and the pinning thing just needs to wait
> > an RCU grace period, and then can do whatever it likes.
> > 
> > The "cleanup" is, in this case, just a call to rcu_kfree().  There is
> > no need for umount(2) to wait for it.
> > 
> > 
> > Certainly any state that the pinning structure has that relates to the
> > pinned structure must be cleaned up before calling pin_remove, so for
> > example dput() must be called on path.dentry *before* pin_remove is
> > called on path.mnt.  But other parts of the pinning structure can be
> > handled as its owner chooses.
> 
> Then what's the difference between that and what's getting done in ->kill()
> triggered by cleanup_mnt()?

Uhm... probably nothing.  I'm not sure what you are getting at.
I just need to do it at a different time to cleanup_mnt(), but also to
be aware that doing it might race with clean_mnt().


> 
> In any case, you have two possibilities - explicit unexport triggering that
> dput(), etc. and umount(2) triggering the same.  Whoever comes second gets
> to wait until it's done.  So why not make the point where we commit to
> unexporting the sucker the place where we do pin_kill()?  And have ->kill()
> of that thing prevent appearance of new entries, then wait for them to run
> down.  Which is precisely the same thing you want to happen on umount...

The "wait for them to run down" part is the sticking point.  We don't
have any easy way to wait for there to be no more references, so I'd
really like to use the waiting that pin_kill() already does.

I want the ->kill function to just unhash the cache entry, and then
wait for pin_delete() to be called.

The final 'put' on the cache entry calls dput on the dentry and then
pin_remove().

The ->kill function can wait for that to happen by calling pin_kill().

I guess there is no real need for a return value from pin_remove().

So
  static void expkey_pin_kill(struct fs_pin *pin)
  {
     struct svc_expkey *key = container_of(pin, ....);

     cache_delete_entry(key->cd, &key->h);
     pin_kill(&key->ek_pin); /* recursive call will wait for
                              * pin_delete() to be called */
  }

and static void expkey_put(struct kref *ref)
    {
        struct svc_expkey *key = container_of(ref, ....); 

        auth_domain_put(key->ek_client);
        if (test_bit(CACHE_VALID, &key->h.flags) &&
            !test_bit(CACHE_NEGATIVE, &key->h.flags))
               path_put_unpin(&key->ek_path, &key->ek_pin);
        kfree_rcu(key, rcu_head):
    }

We ensure that no new references are taken by svc_expkey_lookup()
calling legitimize_mntget() and returning NULL if that fails.
It should probably call cache_delete_entry() when that happens just to
be on the safe side.
cache_delete_entry() must check if the object is still in the hash
table before deleting it.

So I think it can work nicely without any changes to the fs_pin code.

Can you see any further problems?

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 15, 2015, 4:57 a.m. UTC | #12
On Wed, Jul 15, 2015 at 01:49:48PM +1000, NeilBrown wrote:

> > In any case, you have two possibilities - explicit unexport triggering that
> > dput(), etc. and umount(2) triggering the same.  Whoever comes second gets
> > to wait until it's done.  So why not make the point where we commit to
> > unexporting the sucker the place where we do pin_kill()?  And have ->kill()
> > of that thing prevent appearance of new entries, then wait for them to run
> > down.  Which is precisely the same thing you want to happen on umount...
> 
> The "wait for them to run down" part is the sticking point.  We don't
> have any easy way to wait for there to be no more references, so I'd
> really like to use the waiting that pin_kill() already does.

> I want the ->kill function to just unhash the cache entry, and then
> wait for pin_delete() to be called.

pin_remove, presumably?

> The final 'put' on the cache entry calls dput on the dentry and then
> pin_remove().
> 
> The ->kill function can wait for that to happen by calling pin_kill().

> I guess there is no real need for a return value from pin_remove().
> 
> So
>   static void expkey_pin_kill(struct fs_pin *pin)
>   {
>      struct svc_expkey *key = container_of(pin, ....);
> 
>      cache_delete_entry(key->cd, &key->h);


... and another CPU drops the last reference, freeing the sucker.

>      pin_kill(&key->ek_pin); /* recursive call will wait for
>                               * pin_delete() to be called */

... oopsing on the spot.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 15, 2015, 6:51 a.m. UTC | #13
On Wed, 15 Jul 2015 05:57:56 +0100 Al Viro <viro@ZenIV.linux.org.uk>
wrote:

> On Wed, Jul 15, 2015 at 01:49:48PM +1000, NeilBrown wrote:
> 
> > > In any case, you have two possibilities - explicit unexport triggering that
> > > dput(), etc. and umount(2) triggering the same.  Whoever comes second gets
> > > to wait until it's done.  So why not make the point where we commit to
> > > unexporting the sucker the place where we do pin_kill()?  And have ->kill()
> > > of that thing prevent appearance of new entries, then wait for them to run
> > > down.  Which is precisely the same thing you want to happen on umount...
> > 
> > The "wait for them to run down" part is the sticking point.  We don't
> > have any easy way to wait for there to be no more references, so I'd
> > really like to use the waiting that pin_kill() already does.
> 
> > I want the ->kill function to just unhash the cache entry, and then
> > wait for pin_delete() to be called.
> 
> pin_remove, presumably?

Presumably, yes.

> 
> > The final 'put' on the cache entry calls dput on the dentry and then
> > pin_remove().
> > 
> > The ->kill function can wait for that to happen by calling pin_kill().
> 
> > I guess there is no real need for a return value from pin_remove().
> > 
> > So
> >   static void expkey_pin_kill(struct fs_pin *pin)
> >   {
> >      struct svc_expkey *key = container_of(pin, ....);
> > 
> >      cache_delete_entry(key->cd, &key->h);
> 
> 
> ... and another CPU drops the last reference, freeing the sucker.

*That's* why I wanted pin_remove to return something.

Why didn't you like that option again?

To recap:
pin_remove does:

  tmp = pin->done;
  pin->done = 1;
  ...
  return tmp;

path_put_unpin() returns the return value of pin_remove()

expkey_put() does:
   auth_domain_put(key->ek_client);
   if (test some bits)
       if (path_put_unpin() < 0)
          /* pin_kill has taken over */
          return
   kfree_rcu(key, rcu_head);
}

and expkey_pin_kill() does:

   cache_delete_entry(key->cd, &key->h);
   pin_kill(&key->ek_pin);
   kfee_rcu(key, rcu_head);
}


Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields July 15, 2015, 9:07 p.m. UTC | #14
On Mon, Jul 13, 2015 at 01:39:34PM +1000, NeilBrown wrote:
> On Sat, 11 Jul 2015 20:52:56 +0800 Kinglong Mee <kinglongmee@gmail.com>
> wrote:
> 
> > If there are some mount points(not exported for nfs) under pseudo root,
> > after client's operation of those entry under the root,  anyone *can't*
> > unmount those mount points until export cache expired.
> > 
> > /nfs/xfs        *(rw,insecure,no_subtree_check,no_root_squash)
> > /nfs/pnfs       *(rw,insecure,no_subtree_check,no_root_squash)
> > total 0
> > drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
> > drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
> > drwxr-xr-x. 2 root root  6 Apr 20 22:01 xfs
> > Filesystem                      1K-blocks    Used Available Use% Mounted on
> > ......
> > /dev/sdd                          1038336   32944   1005392   4% /nfs/pnfs
> > /dev/sdc                         10475520   32928  10442592   1% /nfs/xfs
> > /dev/sde                           999320    1284    929224   1% /nfs/test
> > /mnt/pnfs/:
> > total 0
> > -rw-r--r--. 1 root root 0 Apr 21 22:23 attr
> > drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp
> > 
> > /mnt/xfs/:
> > total 0
> > umount: /nfs/test/: target is busy
> >         (In some cases useful info about processes that
> >         use the device is found by lsof(8) or fuser(1).)
> > 
> > It's caused by exports cache of nfsd holds the reference of
> > the path (here is /nfs/test/), so, it can't be umounted.
> > 
> > I don't think that's user expect, they want umount /nfs/test/.
> > Bruce think user can also umount /nfs/pnfs/ and /nfs/xfs.
> > 
> > Also, using kzalloc for all memory allocating without kmalloc.
> > Thanks for Al Viro's commets for the logic of fs_pin.
> > 
> > v3,
> > 1. using path_get_pin/path_put_unpin for path pin
> > 2. using kzalloc for memory allocating
> > 
> > v4,
> > 1. add a completion for pin_kill waiting the reference is decreased to zero.
> > 2. add a work_struct for pin_kill decreases the reference indirectly.
> > 3. free svc_export/svc_expkey in pin_kill, not svc_export_put/svc_expkey_put.
> > 4. svc_export_put/svc_expkey_put go though pin_kill logic.
> > 
> > v5, same as v4.
> > 
> > v6,
> > 1. Pin vfsmnt to mount point at first, when reference increace (==2),
> >    grab a reference to vfsmnt by mntget. When decreace (==1),
> >    drop the reference to vfsmnt, left pin.
> > 2. Delete cache_head directly from cache_detail.
> > 
> > v7, 
> > implement self reference increase and decrease for nfsd exports/expkey 
> > 
> > When reference of cahce_head increase(>1), grab a reference of mnt once.
> > and reference decrease to 1 (==1), drop the reference of mnt.
> > 
> > So after that,
> > When ref > 1, user cannot umount the filesystem with -EBUSY.
> > when ref ==1, means cache only reference by nfsd cache,
> > no other reference. So user can try umount,
> > 1. before set MNT_UMOUNT (protected by mount_lock), nfsd cache is
> >    referenced (ref > 1, legitimize_mntget), umount will fail with -EBUSY.
> > 2. after set MNT_UMOUNT, nfsd cache is referenced (ref == 2),
> >    legitimize_mntget will fail, and set cache to CACHE_NEGATIVE,
> >    and the reference will be dropped, re-back to 1.
> >    So, pin_kill can delete the cache and umount success.
> > 3. when umountting, no reference to nfsd cache,
> >    pin_kill can delete the cache and umount success.
> > 
> > Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
> 
> Wow.... this is turning out to be a lot more complex that I imagined at
> first (isn't that always the way!).
> 
> There is a lot of good stuff here, but I think we can probably make it
> simpler and so even better.

I'm still not convinced that the expkey should have a dentry reference
in the key in the first place.  Fixing that would fix the immediate
problem.

(Though it might still be useful to have a way to do stuff on umount of
an exported filesystem.)

--b.

> 
> I particularly don't like the get_ref/put_ref pointers in cache_head.
> They make cache_head a lot bigger than it was before, and they are only
> needed for two specific caches.  And then they are the same for every element
> in the cache.
> 
> I also don't like the ref_mutex ... or I don't like where it is used...
> or something.  I definitely don't think we need one per cached entry.
> Maybe one per cache.
> 
> I can certainly see that the "first" time we get a reference to a cache
> item that holds a vfsmnt pointer, we need to "legitimize" that - or
> fail.  But I don't think that has to happen inside the cache.c
> machinery.
> 
> How about this:
>  - add a new cache flag "CACHE_ACTIVE" (for example) which the cache
>    owner can set whenever it likes.  When cache_put finds that CACHE_ACTIVE
>    is set when refcount is <= 2, it calls a new cache_detail method: cache_deactivate.
>  - cache_deactivate takes a mutex (yes, we do need one, don't we)
>    and if CACHE_ACTIVE is still set and refcount is still <= 2,
>    it drops the reference on the vfsmnt and clears CACHE_ACTIVE.
>    This actually needs to be something like:
>     if (test_and_clear_bit(CACHE_ACTIVE,...)) {
>         if (atomic_read(..refcnt) > 2) {
>              set_bit(CACHE_ACTIVE);
>              mutex_unlock()
>              return
> 
>    so that if other code gets a reference and tests CACHE_ACTIVE, it
>    won't suddenly become inactive.  Might need a memory barrier in there...
>    no, test_and_clear implies a memory barrier.
> 
> We only need to make changes to svc_export and svc_expkey - right?
> So that would be:
>  Change svc_export_lookup and svc_expkey_lookup so they look something
>  like:
> 
>   svc_XX_lookup(struct cache_detail *cd, struct svc_XXX *item)
>   {
>       struct cache_head *ch;
>       int hash = svc_XXX_hash(item);
> 
>       ch = sunrpc_cache_lookup(cd, &item->h, hash);
>       if (!ch)
>            return NULL;
>       item = container_of(ch, struct svc_XXX, h);
>       if (!test_bit(CACHE_VALID, &ch->flags) ||
>           test_bit(CACHE_NEGATIVE, &ch->flags) ||
>           test_bit(CACHE_ACTIVE, &ch->flags))
>             return item;
> 
>       mutex_lock(&svc_XXX_mutex);
>       if (!test_bit(CACHE_ACTIVE, &ch->flags)) {
>               if (legitimize_mnt_get() == NULL) {
>                       XXX_put(item);
>                       item = NULL;
>               } else
>                       set_bit(CACHE_ACTIVE, &ch->flags);
>       }
>       mutex_unlock(&something);
>       return item;
>  }
> 
> Then the new 'cache_deactivate' function is something like:
> 
>   svc_XXX_deactivate(struct cache_detail *cd, struct cache_head *ch)
>   {
>        struct svc_XXX *item = container_of(ch, &item->h, item);
> 
>        mutex_lock(&svc_XXX_mutex);
>        if (test_and_clear_bit(CACHE_ACTIVE, &ch->flags)) {
>               if (atomic_read(&ch->ref.refcount) > 2) {
>                    /* Race with get_ref - do nothing */
>                    set_bit(CACHE_ACTIVE, &ch->flags);
>               else
>                    mntput(....mnt);
>        }
>        mutex_unlock(&svc_XXX_mutex);
>   }
> 
> 
> cache_put would have:
> 
>     if (test_bit(CACHE_ACTIVE, &h->flags) &&
>         cd->cache_deactivate &&
>         atomic_read(&h->ref.refcount <= 2))
>            cd->cache_deactivate(cd, h);
> 
> but there is still a race.  If: (T1 and T2 are threads)
>    T1: cache_put finds refcount is 2 and CACHE_ACTIVE is set and calls ->cache_deactiveate
>    T2: cache_get increments the refcount to 3
>    T1: cache_deactivate clears CACHE_ACTIVE and find refcount is 3
>    T2: now calls cache_put, which sees CACHE_ACTIVE is clear so refcount becomes 2
>    T1: sets CACHE_ACTIVE again and continues.  refcount becomes 1.
> 
> So not refcount is 1 and the item is still active.
> 
> We can fix this by making cache_put loop:
>     while (test_bit(CACHE_ACTIVE, &h->flags) &&
>           cd->cache_deactivate &&
>           (smb_rmb(), 1) &&
>           atomic_read(&h->ref.refcount <= 2))
>            cd->cache_deactivate(cd, h);
> 
> This should ensure that refcount never gets to 1 with the
> item still active (i.e. with a ref count on the mnt).
> 
> 
> The work item and completion are a bit unfortunate too.
> 
> I guess the problem here is that pin_kill() can run while there are
> some inactive references to the cache item.  There can then be a race
> over who will use path_put_unpin to put the dentry.
> 
> Could we fix that by having expXXX_pin_kill() use kref_get_unless_zero()
> on the cache item.
> If that succeeds, then path_put_unpin hasn't been called and it won't be.
> So expXXX_pin_kill can call it and then set CACHE_NEGATIVE.
> If it fails, then it has already been called and nothing else need be done.
> Almost.
> If kref_get_unless_zero() fails, pin_remove() may not have been called
> yet, but it will be soon.  We might need to wait.
> It would be nice if pin_kill() would check ->done again after calling p->kill.
> e.g.
> 
> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> index 611b5408f6ec..c2ef5c9d4c0d 100644
> --- a/fs/fs_pin.c
> +++ b/fs/fs_pin.c
> @@ -47,7 +47,9 @@ void pin_kill(struct fs_pin *p)
>  		spin_unlock_irq(&p->wait.lock);
>  		rcu_read_unlock();
>  		p->kill(p);
> -		return;
> +		if (p->done > 0)
> +			return;
> +		spin_lock_irq(&p->wait.lock);
>  	}
>  	if (p->done > 0) {
>  		spin_unlock_irq(&p->wait.lock);
> 
> I think that would close the last gap, without needing extra work
> items and completion in the nfsd code.
> 
> Al: would you be OK with that change to pin_kill?
> 
> Thanks,
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 15, 2015, 11:40 p.m. UTC | #15
On Wed, 15 Jul 2015 17:07:56 -0400 "J. Bruce Fields"
<bfields@fieldses.org> wrote:

> > Wow.... this is turning out to be a lot more complex that I imagined at
> > first (isn't that always the way!).
> > 
> > There is a lot of good stuff here, but I think we can probably make it
> > simpler and so even better.
> 
> I'm still not convinced that the expkey should have a dentry reference
> in the key in the first place.  Fixing that would fix the immediate
> problem.

???  If we removed the dentry, how would you export a subdirectory of a
filesystem?

NeilBrown


> 
> (Though it might still be useful to have a way to do stuff on umount of
> an exported filesystem.)
> 
> --b.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields July 16, 2015, 8:51 p.m. UTC | #16
On Thu, Jul 16, 2015 at 09:40:46AM +1000, NeilBrown wrote:
> On Wed, 15 Jul 2015 17:07:56 -0400 "J. Bruce Fields"
> <bfields@fieldses.org> wrote:
> 
> > > Wow.... this is turning out to be a lot more complex that I imagined at
> > > first (isn't that always the way!).
> > > 
> > > There is a lot of good stuff here, but I think we can probably make it
> > > simpler and so even better.
> > 
> > I'm still not convinced that the expkey

(Sorry, I meant an entry in the export cache, not the expkey cache.)

> should have a dentry reference
> > in the key in the first place.  Fixing that would fix the immediate
> > problem.
> 
> ???  If we removed the dentry, how would you export a subdirectory of a
> filesystem?

I've been wondering if the export cache should really be keyed on the
string representation of the path instead of the struct path.  That's
what the userspace interface uses.

There's a related bug: if there are mountpoints at both /a and /a/b,
then thanks to the lookup-underneath-mountpoint behavior of the server,
an NFSv3 client looking that up will end up going underneath the first
mountpoint and doing an export cache lookup for

	(vfsmnt, dentry) == (/, /a/b)

When the server gets a response that starts with "/a/b", it interprets
that as applying to the path (/a, /a/b), so doesn't recognize it as
resolving the query about (/, /a/b).

Well, at least I assume that's why I see "ls" hang if I run "ls
/mnt/a/b" on the client.  And there may be some better fix, but I always
figured the root (hah) problem here was due to indexing the cache on
struct path while the upcall interface uses the full path string.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 21, 2015, 9:58 p.m. UTC | #17
On Thu, 16 Jul 2015 16:51:48 -0400 "J. Bruce Fields"
<bfields@fieldses.org> wrote:

> On Thu, Jul 16, 2015 at 09:40:46AM +1000, NeilBrown wrote:
> > On Wed, 15 Jul 2015 17:07:56 -0400 "J. Bruce Fields"
> > <bfields@fieldses.org> wrote:
> > 
> > > > Wow.... this is turning out to be a lot more complex that I imagined at
> > > > first (isn't that always the way!).
> > > > 
> > > > There is a lot of good stuff here, but I think we can probably make it
> > > > simpler and so even better.
> > > 
> > > I'm still not convinced that the expkey
> 
> (Sorry, I meant an entry in the export cache, not the expkey cache.)

They are very closely related.  An incoming filehandle has its 'fs'
identifier mapped through the expkey cache to get an "export point".
Then the "export point" plus "client identifier" are mapped through the
export cache to get export options.

So the "export point" thing in the expkey cache should really be the
same as the thing in the export cache.

> 
> > should have a dentry reference
> > > in the key in the first place.  Fixing that would fix the immediate
> > > problem.
> > 
> > ???  If we removed the dentry, how would you export a subdirectory of a
> > filesystem?
> 
> I've been wondering if the export cache should really be keyed on the
> string representation of the path instead of the struct path.  That's
> what the userspace interface uses.

That makes sense for handling updates to the cache from user-space.
I'm not sure it is so good for handling mapping from file-handle to
export flags.

> 
> There's a related bug: if there are mountpoints at both /a and /a/b,

Just to make sure I'm clear on what you are saying, this would be
achieved by, e.g.

 mkdir -p /a/b
 mount /dev/sdX /a/b
 mount /dev/sdY /a

so the mount of /dev/sdX is complete unreachable from '/'.

Correct?

> then thanks to the lookup-underneath-mountpoint behavior of the server,
> an NFSv3 client looking that up will end up going underneath the first
> mountpoint and doing an export cache lookup for
> 
> 	(vfsmnt, dentry) == (/, /a/b)

Maybe this step is where the bug is. "/a/b" is not really a valid name.
Should d_path() check for paths that have been mounted over, and attach
"(unreachable)" to the end of the path, similar to "(deleted)".

sys_getcwd() can give you "(unreachable)" when in a filesystem that has
e.g. been lazy-unmounted.  Maybe we want something similar for a
mounted-over filesystem???


> 
> When the server gets a response that starts with "/a/b", it interprets
> that as applying to the path (/a, /a/b), so doesn't recognize it as
> resolving the query about (/, /a/b).
> 
> Well, at least I assume that's why I see "ls" hang if I run "ls
> /mnt/a/b" on the client.  And there may be some better fix, but I always
> figured the root (hah) problem here was due to indexing the cache on
> struct path while the upcall interface uses the full path string.
> 
Sounds like a very odd corner case - how did you stumble on to it?

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields July 22, 2015, 3:08 p.m. UTC | #18
On Wed, Jul 22, 2015 at 07:58:24AM +1000, NeilBrown wrote:
> On Thu, 16 Jul 2015 16:51:48 -0400 "J. Bruce Fields"
> <bfields@fieldses.org> wrote:
> 
> > On Thu, Jul 16, 2015 at 09:40:46AM +1000, NeilBrown wrote:
> > > On Wed, 15 Jul 2015 17:07:56 -0400 "J. Bruce Fields"
> > > <bfields@fieldses.org> wrote:
> > > 
> > > > > Wow.... this is turning out to be a lot more complex that I imagined at
> > > > > first (isn't that always the way!).
> > > > > 
> > > > > There is a lot of good stuff here, but I think we can probably make it
> > > > > simpler and so even better.
> > > > 
> > > > I'm still not convinced that the expkey
> > 
> > (Sorry, I meant an entry in the export cache, not the expkey cache.)
> 
> They are very closely related.  An incoming filehandle has its 'fs'
> identifier mapped through the expkey cache to get an "export point".
> Then the "export point" plus "client identifier" are mapped through the
> export cache to get export options.
> 
> So the "export point" thing in the expkey cache should really be the
> same as the thing in the export cache.
> 
> > 
> > > should have a dentry reference
> > > > in the key in the first place.  Fixing that would fix the immediate
> > > > problem.
> > > 
> > > ???  If we removed the dentry, how would you export a subdirectory of a
> > > filesystem?
> > 
> > I've been wondering if the export cache should really be keyed on the
> > string representation of the path instead of the struct path.  That's
> > what the userspace interface uses.
> 
> That makes sense for handling updates to the cache from user-space.
> I'm not sure it is so good for handling mapping from file-handle to
> export flags.
> 
> > 
> > There's a related bug: if there are mountpoints at both /a and /a/b,
> 
> Just to make sure I'm clear on what you are saying, this would be
> achieved by, e.g.
> 
>  mkdir -p /a/b
>  mount /dev/sdX /a/b
>  mount /dev/sdY /a
> 
> so the mount of /dev/sdX is complete unreachable from '/'.
> 
> Correct?

Actually my reproducer used something like:

	server# mount --bind /a
	server# mount --bind /a/b

then a v3 mount of / and "ls /mnt/a/b" from the client.

> > then thanks to the lookup-underneath-mountpoint behavior of the server,
> > an NFSv3 client looking that up will end up going underneath the first
> > mountpoint and doing an export cache lookup for
> > 
> > 	(vfsmnt, dentry) == (/, /a/b)
> 
> Maybe this step is where the bug is. "/a/b" is not really a valid name.
> Should d_path() check for paths that have been mounted over, and attach
> "(unreachable)" to the end of the path, similar to "(deleted)".
> 
> sys_getcwd() can give you "(unreachable)" when in a filesystem that has
> e.g. been lazy-unmounted.  Maybe we want something similar for a
> mounted-over filesystem???
> 
> 
> > 
> > When the server gets a response that starts with "/a/b", it interprets
> > that as applying to the path (/a, /a/b), so doesn't recognize it as
> > resolving the query about (/, /a/b).
> > 
> > Well, at least I assume that's why I see "ls" hang if I run "ls
> > /mnt/a/b" on the client.  And there may be some better fix, but I always
> > figured the root (hah) problem here was due to indexing the cache on
> > struct path while the upcall interface uses the full path string.
> > 
> Sounds like a very odd corner case - how did you stumble on to it?

I dug through my old email, but that may be lost in the mists of
time....

My memory is that I ran across a similar hang while testing some mountd
changes, but couldn't reproduce it.  (It might have involved a change to
the exports?)  So came up with this case by inspection.

I've had this nagging todo to work out if there are other interesting
consequences of the fact that the cache is internally keyed on one thing
and appears to mountd to be keyed on another.  (And that there's a
complicated many<->many relationship between those two things.)  But I
haven't gotten to it.  Could be all unlikely corner cases, for all I
know.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 24, 2015, 2:05 a.m. UTC | #19
On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
wrote:

> On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
> 
> > Actually, with that change to pin_kill, this side of things becomes
> > really easy.
> > All expXXX_pin_kill needs to do is call your new cache_delete_entry.
> > If that doesn't cause the entry to be put, then something else has a
> > temporary reference which will be put soon.  In any case, pin_kill()
> > will wait long enough, but not indefinitely.
> > No need for kref_get_unless_zero() or any of that.
> 
> No.  You are seriously misunderstanding what ->kill() is for and what the
> existing instances are doing.  Again, there is no promise whatsoever that
> the object containing fs_pin instance will *survive* past ->kill().
> At all.
> 
> RTFS, please.  What is sorely missing in this recurring patchset is a clear
> description of lifetime rules and ordering (who waits for whom and how long).
> For all the objects involved.

Good point.  Let me try.

Entries in the sunrpc 'cache' each contain some 'key' fields and some
'content' fields.

The key fields are set by the .init() method when the entry is
created, which can happen in a call to sunrpc_cache_lookup() or to
sunrpc_cache_update().

The content fields are set by the .update() method when a value is
provided for the cache entry.  This happens in sunrpc_cache_update();

A cache entry can be not-valid, negative, or valid.
It starts non-valid when sunrpc_cache_lookup() fails to find the search
key and so creates a new entry (and sets up the key with .init).
It then transitions to either negative or valid.
This can happen through sunrpc_cache_update() or through an error when
instigating an up-call, in which case it goes to negative.
Once it is negative or valid, it stays that way until it is released.
If sunrpc_cache_update is called on an entry that is not not-valid,
then a new entry is created and the old one is marked as expired.
A cache search will find the new one before the old.

The vfsmount object is involved in two separate caches.
It is part of the content of svc_expkey and part of the key of
svc_export.

An svc_expkey entry is only ever held transiently.  It is held while an
update is being processed, and it is held briefly while mapping a
filehandle to a mnt+dentry.
Firstly part of the filehandle is used to acccess the svc_expkey cache
to get the vfsmnt.  Then that vfsmnt plus the client identification is
looked up in the svc_export cache to find the export options.  Then the
svc_expkey cache entry is released.

So it is only held during a lookup of another cache.  This can take an
arbitrarily long time as the lookup can go to rpc.mountd in user-space.


The svc_export cache entry can be held for the duration of a single NFS
request.  It is stored in the 'struct svc_fh' file handle structure
which is release at the end of handling the request.

The vfsmnt and dentry are only "used" to validate the filehandle and
then while that filehandle is still active.


To avoid having unmount hang while nfsd is performing an upcall to
mountd, we need to legitimize the vfsmnt in the svc_expkey.  If that
fails, exp_find_key() can fail and we would never perform the lookup on
svc_export.

If it succeeds, then the legitimacy can be handed over to the svc_export
cache entry, which could then continue to own it, or could hand it on
to the svc_fh.

The latter is *probably* cleanest.
i.e. an svc_fh should always own a reference to exp->ex_path.mnt, and
fh_put must put it.

exp_find_key needs to legitimize ek->ek_path.mnt, so a successful
return from exp_find implies an active refernece to ->ex_path.mnt.
If exp_find fails, it needs to mnt_put(ek->ek_path.mnt).
All callers of exp_find need to mnt_put(exp->ex_path.mnt) when they
decide not to use the exp, and must otherwise store it in an svc_fh.

With this, pin_kill() should only need to wait for  exp_find_key() to
discover that it cannot legitimize the mount, or for expkey_path() to
replace the key via sunrpc_cache_update(), or maybe for cache_clean()
to discard an old entry.

Hopefully that makes it all clear.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kinglong Mee July 27, 2015, 2:28 a.m. UTC | #20
On 7/24/2015 10:05, NeilBrown wrote:
> On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
> wrote:
> 
>> On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
>>
>>> Actually, with that change to pin_kill, this side of things becomes
>>> really easy.
>>> All expXXX_pin_kill needs to do is call your new cache_delete_entry.
>>> If that doesn't cause the entry to be put, then something else has a
>>> temporary reference which will be put soon.  In any case, pin_kill()
>>> will wait long enough, but not indefinitely.
>>> No need for kref_get_unless_zero() or any of that.
>>
>> No.  You are seriously misunderstanding what ->kill() is for and what the
>> existing instances are doing.  Again, there is no promise whatsoever that
>> the object containing fs_pin instance will *survive* past ->kill().
>> At all.
>>
>> RTFS, please.  What is sorely missing in this recurring patchset is a clear
>> description of lifetime rules and ordering (who waits for whom and how long).
>> For all the objects involved.
> 
> Good point.  Let me try.
> 
> Entries in the sunrpc 'cache' each contain some 'key' fields and some
> 'content' fields.
> 
> The key fields are set by the .init() method when the entry is
> created, which can happen in a call to sunrpc_cache_lookup() or to
> sunrpc_cache_update().
> 
> The content fields are set by the .update() method when a value is
> provided for the cache entry.  This happens in sunrpc_cache_update();
> 
> A cache entry can be not-valid, negative, or valid.
> It starts non-valid when sunrpc_cache_lookup() fails to find the search
> key and so creates a new entry (and sets up the key with .init).
> It then transitions to either negative or valid.
> This can happen through sunrpc_cache_update() or through an error when
> instigating an up-call, in which case it goes to negative.
> Once it is negative or valid, it stays that way until it is released.
> If sunrpc_cache_update is called on an entry that is not not-valid,
> then a new entry is created and the old one is marked as expired.
> A cache search will find the new one before the old.
> 
> The vfsmount object is involved in two separate caches.
> It is part of the content of svc_expkey and part of the key of
> svc_export.
> 
> An svc_expkey entry is only ever held transiently.  It is held while an
> update is being processed, and it is held briefly while mapping a
> filehandle to a mnt+dentry.
> Firstly part of the filehandle is used to acccess the svc_expkey cache
> to get the vfsmnt.  Then that vfsmnt plus the client identification is
> looked up in the svc_export cache to find the export options.  Then the
> svc_expkey cache entry is released.
> 
> So it is only held during a lookup of another cache.  This can take an
> arbitrarily long time as the lookup can go to rpc.mountd in user-space.
> 
> 
> The svc_export cache entry can be held for the duration of a single NFS
> request.  It is stored in the 'struct svc_fh' file handle structure
> which is release at the end of handling the request.
> 
> The vfsmnt and dentry are only "used" to validate the filehandle and
> then while that filehandle is still active.
> 
> 
> To avoid having unmount hang while nfsd is performing an upcall to
> mountd, we need to legitimize the vfsmnt in the svc_expkey.  If that
> fails, exp_find_key() can fail and we would never perform the lookup on
> svc_export.
> 
> If it succeeds, then the legitimacy can be handed over to the svc_export
> cache entry, which could then continue to own it, or could hand it on
> to the svc_fh.
> 
> The latter is *probably* cleanest.
> i.e. an svc_fh should always own a reference to exp->ex_path.mnt, and
> fh_put must put it.

I don't agree adding new argument (eg, fh_vfsmnt) in svc_fh.

With it, should nfsd using fh_vfsmnt always, never using exp->ex_path.mnt
outside of export.c/export.h ?

If choose fh_vfsmnt, so many codes need be updated, especially functions.
If exp->ex_path.mnt, the new argument fh_vfsmnt seems redundant.

Thanks for your work.

It reminders a new method,

1. There are only one outlet from each cache, exp_find_key() for expkey, 
   exp_get_by_name() for export.
2. Any fsid to export or filehandle to export will call the function.
3. exp_get()/exp_put() increase/decrease the reference of export.

Like the fh_vfsmnt (not same), call legitimize_mntget() in the only
outlet function exp_find_key()/exp_get_by_name(), if fail return STALE,
otherwise, any valid expkey/export from the cache is validated (Have
get the reference of vfsmnt).

Add mntget() in exp_get() and mntput() in exp_put(), because the export
passed to exp_get/exp_put are returned from exp_find_key/exp_get_by_name.

> 
> exp_find_key needs to legitimize ek->ek_path.mnt, so a successful
> return from exp_find implies an active refernece to ->ex_path.mnt.
> If exp_find fails, it needs to mnt_put(ek->ek_path.mnt).

Yes, it's great.

> All callers of exp_find need to mnt_put(exp->ex_path.mnt) when they
> decide not to use the exp, and must otherwise store it in an svc_fh.
> 
> With this, pin_kill() should only need to wait for  exp_find_key() to
> discover that it cannot legitimize the mount, or for expkey_path() to
> replace the key via sunrpc_cache_update(), or maybe for cache_clean()
> to discard an old entry.
> 
> Hopefully that makes it all clear.

Yes, thanks again.

With my method, for expkey cache,
1. At first, a fsid is passed to exp_find_key, and lookup a cache
   in svc_expkey_lookup, if success, ekey->ek_path is pined to mount.
2. Then call legitimize_mntget getting a reference of vfsmnt 
   before return from exp_find_key.
3. Any calling exp_find_key with valid cache must put the vfsmnt.

for export cache,
1. At first, a path (returned from exp_find_key) with validate vfsmnt
   is passed to exp_get_by_name, if success, exp->ex_path is pined to mount.
2. Then call legitimize_mntget getting a reference of vfsmnt 
   before return from exp_get_by_name.
3. Any calling exp_get_by_name with valid cache must put the vfsmnt
   by exp_put();
4. Any using the exp returned from exp_get_by_name must call exp_get(),
   will increase the reference of vfsmnt.

So,
a. After getting the reference in 2, any umount of filesystem will get -EBUSY.
b. After put all reference after 4, or before get the reference in 2, 
   any umount of filesystem will call pin_kill, and delete the cache directly,
   also unpin the vfsmount.
c. Between 1 and 2, have get the reference of exp/key cache, with invalidate vfsmnt.
   As you said, umount of filesystem only wait exp_find_key/exp_get_by_name
   put the reference of cache when legitimize_mntget fail.

A new update of this patch site will be push later.

thanks,
Kinglong Mee
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown July 27, 2015, 2:51 a.m. UTC | #21
On Mon, 27 Jul 2015 10:28:52 +0800 Kinglong Mee <kinglongmee@gmail.com>
wrote:

> On 7/24/2015 10:05, NeilBrown wrote:
> > On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
> > wrote:
> > 
> >> On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
> >>
> >>> Actually, with that change to pin_kill, this side of things becomes
> >>> really easy.
> >>> All expXXX_pin_kill needs to do is call your new cache_delete_entry.
> >>> If that doesn't cause the entry to be put, then something else has a
> >>> temporary reference which will be put soon.  In any case, pin_kill()
> >>> will wait long enough, but not indefinitely.
> >>> No need for kref_get_unless_zero() or any of that.
> >>
> >> No.  You are seriously misunderstanding what ->kill() is for and what the
> >> existing instances are doing.  Again, there is no promise whatsoever that
> >> the object containing fs_pin instance will *survive* past ->kill().
> >> At all.
> >>
> >> RTFS, please.  What is sorely missing in this recurring patchset is a clear
> >> description of lifetime rules and ordering (who waits for whom and how long).
> >> For all the objects involved.
> > 
> > Good point.  Let me try.
> > 
> > Entries in the sunrpc 'cache' each contain some 'key' fields and some
> > 'content' fields.
> > 
> > The key fields are set by the .init() method when the entry is
> > created, which can happen in a call to sunrpc_cache_lookup() or to
> > sunrpc_cache_update().
> > 
> > The content fields are set by the .update() method when a value is
> > provided for the cache entry.  This happens in sunrpc_cache_update();
> > 
> > A cache entry can be not-valid, negative, or valid.
> > It starts non-valid when sunrpc_cache_lookup() fails to find the search
> > key and so creates a new entry (and sets up the key with .init).
> > It then transitions to either negative or valid.
> > This can happen through sunrpc_cache_update() or through an error when
> > instigating an up-call, in which case it goes to negative.
> > Once it is negative or valid, it stays that way until it is released.
> > If sunrpc_cache_update is called on an entry that is not not-valid,
> > then a new entry is created and the old one is marked as expired.
> > A cache search will find the new one before the old.
> > 
> > The vfsmount object is involved in two separate caches.
> > It is part of the content of svc_expkey and part of the key of
> > svc_export.
> > 
> > An svc_expkey entry is only ever held transiently.  It is held while an
> > update is being processed, and it is held briefly while mapping a
> > filehandle to a mnt+dentry.
> > Firstly part of the filehandle is used to acccess the svc_expkey cache
> > to get the vfsmnt.  Then that vfsmnt plus the client identification is
> > looked up in the svc_export cache to find the export options.  Then the
> > svc_expkey cache entry is released.
> > 
> > So it is only held during a lookup of another cache.  This can take an
> > arbitrarily long time as the lookup can go to rpc.mountd in user-space.
> > 
> > 
> > The svc_export cache entry can be held for the duration of a single NFS
> > request.  It is stored in the 'struct svc_fh' file handle structure
> > which is release at the end of handling the request.
> > 
> > The vfsmnt and dentry are only "used" to validate the filehandle and
> > then while that filehandle is still active.
> > 
> > 
> > To avoid having unmount hang while nfsd is performing an upcall to
> > mountd, we need to legitimize the vfsmnt in the svc_expkey.  If that
> > fails, exp_find_key() can fail and we would never perform the lookup on
> > svc_export.
> > 
> > If it succeeds, then the legitimacy can be handed over to the svc_export
> > cache entry, which could then continue to own it, or could hand it on
> > to the svc_fh.
> > 
> > The latter is *probably* cleanest.
> > i.e. an svc_fh should always own a reference to exp->ex_path.mnt, and
> > fh_put must put it.
> 
> I don't agree adding new argument (eg, fh_vfsmnt) in svc_fh.

I wasn't suggesting that a new field be added to svc_fh.
Just that if svc_fh->fh_export was not NULL, then the svc_fh "owned" a
reference to svc_fh->fh_export->ex_path.mnt which it had to mnt_put()
when it released ->fh_export.

So fh_put would need to change, but not much else.

It isn't the only way to handle that references - it just seemed the
neatest as I was writing the description.  Something else might work
better in the code.

> 
> With it, should nfsd using fh_vfsmnt always, never using exp->ex_path.mnt
> outside of export.c/export.h ?
> 
> If choose fh_vfsmnt, so many codes need be updated, especially functions.
> If exp->ex_path.mnt, the new argument fh_vfsmnt seems redundant.
> 
> Thanks for your work.
> 
> It reminders a new method,
> 
> 1. There are only one outlet from each cache, exp_find_key() for expkey, 
>    exp_get_by_name() for export.
> 2. Any fsid to export or filehandle to export will call the function.
> 3. exp_get()/exp_put() increase/decrease the reference of export.
> 
> Like the fh_vfsmnt (not same), call legitimize_mntget() in the only
> outlet function exp_find_key()/exp_get_by_name(), if fail return STALE,
> otherwise, any valid expkey/export from the cache is validated (Have
> get the reference of vfsmnt).
> 
> Add mntget() in exp_get() and mntput() in exp_put(), because the export
> passed to exp_get/exp_put are returned from exp_find_key/exp_get_by_name.
> 
> > 
> > exp_find_key needs to legitimize ek->ek_path.mnt, so a successful
> > return from exp_find implies an active refernece to ->ex_path.mnt.
> > If exp_find fails, it needs to mnt_put(ek->ek_path.mnt).
> 
> Yes, it's great.
> 
> > All callers of exp_find need to mnt_put(exp->ex_path.mnt) when they
> > decide not to use the exp, and must otherwise store it in an svc_fh.
> > 
> > With this, pin_kill() should only need to wait for  exp_find_key() to
> > discover that it cannot legitimize the mount, or for expkey_path() to
> > replace the key via sunrpc_cache_update(), or maybe for cache_clean()
> > to discard an old entry.
> > 
> > Hopefully that makes it all clear.
> 
> Yes, thanks again.
> 
> With my method, for expkey cache,
> 1. At first, a fsid is passed to exp_find_key, and lookup a cache
>    in svc_expkey_lookup, if success, ekey->ek_path is pined to mount.
> 2. Then call legitimize_mntget getting a reference of vfsmnt 
>    before return from exp_find_key.
> 3. Any calling exp_find_key with valid cache must put the vfsmnt.
> 
> for export cache,
> 1. At first, a path (returned from exp_find_key) with validate vfsmnt
>    is passed to exp_get_by_name, if success, exp->ex_path is pined to mount.
> 2. Then call legitimize_mntget getting a reference of vfsmnt 
>    before return from exp_get_by_name.

I don't see any point in calling legitimise_mntget here.  exp_find_key
already did the 'legitimize' bit so there is no need to do it again.

> 3. Any calling exp_get_by_name with valid cache must put the vfsmnt
>    by exp_put();
> 4. Any using the exp returned from exp_get_by_name must call exp_get(),
>    will increase the reference of vfsmnt.
> 
> So,
> a. After getting the reference in 2, any umount of filesystem will get -EBUSY.
> b. After put all reference after 4, or before get the reference in 2, 
>    any umount of filesystem will call pin_kill, and delete the cache directly,
>    also unpin the vfsmount.
> c. Between 1 and 2, have get the reference of exp/key cache, with invalidate vfsmnt.
>    As you said, umount of filesystem only wait exp_find_key/exp_get_by_name
>    put the reference of cache when legitimize_mntget fail.
> 
> A new update of this patch site will be push later.

I look forward to it.  Thanks,

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kinglong Mee July 27, 2015, 3:17 a.m. UTC | #22
On 7/27/2015 10:51, NeilBrown wrote:
> On Mon, 27 Jul 2015 10:28:52 +0800 Kinglong Mee <kinglongmee@gmail.com>
> wrote:
> 
>> On 7/24/2015 10:05, NeilBrown wrote:
>>> On Mon, 13 Jul 2015 05:45:53 +0100 Al Viro <viro@ZenIV.linux.org.uk>
>>> wrote:
>>>
>>>> On Mon, Jul 13, 2015 at 02:20:59PM +1000, NeilBrown wrote:
>>>>
>>>>> Actually, with that change to pin_kill, this side of things becomes
>>>>> really easy.
>>>>> All expXXX_pin_kill needs to do is call your new cache_delete_entry.
>>>>> If that doesn't cause the entry to be put, then something else has a
>>>>> temporary reference which will be put soon.  In any case, pin_kill()
>>>>> will wait long enough, but not indefinitely.
>>>>> No need for kref_get_unless_zero() or any of that.
>>>>
>>>> No.  You are seriously misunderstanding what ->kill() is for and what the
>>>> existing instances are doing.  Again, there is no promise whatsoever that
>>>> the object containing fs_pin instance will *survive* past ->kill().
>>>> At all.
>>>>
>>>> RTFS, please.  What is sorely missing in this recurring patchset is a clear
>>>> description of lifetime rules and ordering (who waits for whom and how long).
>>>> For all the objects involved.
>>>
>>> Good point.  Let me try.
>>>
>>> Entries in the sunrpc 'cache' each contain some 'key' fields and some
>>> 'content' fields.
>>>
>>> The key fields are set by the .init() method when the entry is
>>> created, which can happen in a call to sunrpc_cache_lookup() or to
>>> sunrpc_cache_update().
>>>
>>> The content fields are set by the .update() method when a value is
>>> provided for the cache entry.  This happens in sunrpc_cache_update();
>>>
>>> A cache entry can be not-valid, negative, or valid.
>>> It starts non-valid when sunrpc_cache_lookup() fails to find the search
>>> key and so creates a new entry (and sets up the key with .init).
>>> It then transitions to either negative or valid.
>>> This can happen through sunrpc_cache_update() or through an error when
>>> instigating an up-call, in which case it goes to negative.
>>> Once it is negative or valid, it stays that way until it is released.
>>> If sunrpc_cache_update is called on an entry that is not not-valid,
>>> then a new entry is created and the old one is marked as expired.
>>> A cache search will find the new one before the old.
>>>
>>> The vfsmount object is involved in two separate caches.
>>> It is part of the content of svc_expkey and part of the key of
>>> svc_export.
>>>
>>> An svc_expkey entry is only ever held transiently.  It is held while an
>>> update is being processed, and it is held briefly while mapping a
>>> filehandle to a mnt+dentry.
>>> Firstly part of the filehandle is used to acccess the svc_expkey cache
>>> to get the vfsmnt.  Then that vfsmnt plus the client identification is
>>> looked up in the svc_export cache to find the export options.  Then the
>>> svc_expkey cache entry is released.
>>>
>>> So it is only held during a lookup of another cache.  This can take an
>>> arbitrarily long time as the lookup can go to rpc.mountd in user-space.
>>>
>>>
>>> The svc_export cache entry can be held for the duration of a single NFS
>>> request.  It is stored in the 'struct svc_fh' file handle structure
>>> which is release at the end of handling the request.
>>>
>>> The vfsmnt and dentry are only "used" to validate the filehandle and
>>> then while that filehandle is still active.
>>>
>>>
>>> To avoid having unmount hang while nfsd is performing an upcall to
>>> mountd, we need to legitimize the vfsmnt in the svc_expkey.  If that
>>> fails, exp_find_key() can fail and we would never perform the lookup on
>>> svc_export.
>>>
>>> If it succeeds, then the legitimacy can be handed over to the svc_export
>>> cache entry, which could then continue to own it, or could hand it on
>>> to the svc_fh.
>>>
>>> The latter is *probably* cleanest.
>>> i.e. an svc_fh should always own a reference to exp->ex_path.mnt, and
>>> fh_put must put it.
>>
>> I don't agree adding new argument (eg, fh_vfsmnt) in svc_fh.
> 
> I wasn't suggesting that a new field be added to svc_fh.
> Just that if svc_fh->fh_export was not NULL, then the svc_fh "owned" a
> reference to svc_fh->fh_export->ex_path.mnt which it had to mnt_put()
> when it released ->fh_export.
> 
> So fh_put would need to change, but not much else.
> 
> It isn't the only way to handle that references - it just seemed the
> neatest as I was writing the description.  Something else might work
> better in the code.

Got it, thanks for your comments.

> 
>>
>> With it, should nfsd using fh_vfsmnt always, never using exp->ex_path.mnt
>> outside of export.c/export.h ?
>>
>> If choose fh_vfsmnt, so many codes need be updated, especially functions.
>> If exp->ex_path.mnt, the new argument fh_vfsmnt seems redundant.
>>
>> Thanks for your work.
>>
>> It reminders a new method,
>>
>> 1. There are only one outlet from each cache, exp_find_key() for expkey, 
>>    exp_get_by_name() for export.
>> 2. Any fsid to export or filehandle to export will call the function.
>> 3. exp_get()/exp_put() increase/decrease the reference of export.
>>
>> Like the fh_vfsmnt (not same), call legitimize_mntget() in the only
>> outlet function exp_find_key()/exp_get_by_name(), if fail return STALE,
>> otherwise, any valid expkey/export from the cache is validated (Have
>> get the reference of vfsmnt).
>>
>> Add mntget() in exp_get() and mntput() in exp_put(), because the export
>> passed to exp_get/exp_put are returned from exp_find_key/exp_get_by_name.
>>
>>>
>>> exp_find_key needs to legitimize ek->ek_path.mnt, so a successful
>>> return from exp_find implies an active refernece to ->ex_path.mnt.
>>> If exp_find fails, it needs to mnt_put(ek->ek_path.mnt).
>>
>> Yes, it's great.
>>
>>> All callers of exp_find need to mnt_put(exp->ex_path.mnt) when they
>>> decide not to use the exp, and must otherwise store it in an svc_fh.
>>>
>>> With this, pin_kill() should only need to wait for  exp_find_key() to
>>> discover that it cannot legitimize the mount, or for expkey_path() to
>>> replace the key via sunrpc_cache_update(), or maybe for cache_clean()
>>> to discard an old entry.
>>>
>>> Hopefully that makes it all clear.
>>
>> Yes, thanks again.
>>
>> With my method, for expkey cache,
>> 1. At first, a fsid is passed to exp_find_key, and lookup a cache
>>    in svc_expkey_lookup, if success, ekey->ek_path is pined to mount.
>> 2. Then call legitimize_mntget getting a reference of vfsmnt 
>>    before return from exp_find_key.
>> 3. Any calling exp_find_key with valid cache must put the vfsmnt.
>>
>> for export cache,
>> 1. At first, a path (returned from exp_find_key) with validate vfsmnt
>>    is passed to exp_get_by_name, if success, exp->ex_path is pined to mount.
>> 2. Then call legitimize_mntget getting a reference of vfsmnt 
>>    before return from exp_get_by_name.
> 
> I don't see any point in calling legitimise_mntget here.  exp_find_key
> already did the 'legitimize' bit so there is no need to do it again.

I just think they are in two logical.

But, does export cache contains a different vfsmnt as expkey exist?

thanks,
Kinglong Mee
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/fs_pin.c b/fs/fs_pin.c
index 611b5408f6ec..c2ef5c9d4c0d 100644
--- a/fs/fs_pin.c
+++ b/fs/fs_pin.c
@@ -47,7 +47,9 @@  void pin_kill(struct fs_pin *p)
 		spin_unlock_irq(&p->wait.lock);
 		rcu_read_unlock();
 		p->kill(p);
-		return;
+		if (p->done > 0)
+			return;
+		spin_lock_irq(&p->wait.lock);
 	}
 	if (p->done > 0) {
 		spin_unlock_irq(&p->wait.lock);