diff mbox

[5/5] nfsd: allows user un-mounting filesystem where nfsd exports base on

Message ID 5561E9FA.4050808@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kinglong Mee May 24, 2015, 3:10 p.m. UTC
If there are some mount points(not exported for nfs) under pseudo root,
after client's operation of those entry under the root, anyone *can't*
unmount those mount points until export cache expired.

/nfs/xfs        *(rw,insecure,no_subtree_check,no_root_squash)
/nfs/pnfs       *(rw,insecure,no_subtree_check,no_root_squash)
total 0
drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
drwxr-xr-x. 2 root root  6 Apr 20 22:01 xfs
Filesystem                      1K-blocks    Used Available Use% Mounted on
......
/dev/sdd                          1038336   32944   1005392   4% /nfs/pnfs
/dev/sdc                         10475520   32928  10442592   1% /nfs/xfs
/dev/sde                           999320    1284    929224   1% /nfs/test
/mnt/pnfs/:
total 0
-rw-r--r--. 1 root root 0 Apr 21 22:23 attr
drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp

/mnt/xfs/:
total 0
umount: /nfs/test/: target is busy
        (In some cases useful info about processes that
	use the device is found by lsof(8) or fuser(1).)

I don't think that's user expect, they want umount /nfs/test/.

It's caused by exports cache of nfsd holds the reference of
the path (here is /nfs/test/), so, it can't be umounted.

v2,
1. Update exports according to the "allow_umount" option.
   Pin to vfsmnt default, change when updating.
2. Using kzalloc for all memory allocating without kmalloc.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
---
 fs/nfsd/export.c                 | 52 ++++++++++++++++++++++++++++++++--------
 fs/nfsd/export.h                 | 11 ++++++++-
 include/uapi/linux/nfsd/export.h |  3 ++-
 3 files changed, 54 insertions(+), 12 deletions(-)

Comments

Al Viro June 5, 2015, 3:02 p.m. UTC | #1
On Sun, May 24, 2015 at 11:10:50PM +0800, Kinglong Mee wrote:
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref)
>  
>  	if (test_bit(CACHE_VALID, &key->h.flags) &&
>  	    !test_bit(CACHE_NEGATIVE, &key->h.flags))
> -		path_put(&key->ek_path);
> +		path_put_unpin(&key->ek_path, &key->ek_pin);
>  	auth_domain_put(key->ek_client);
> -	kfree(key);
> +	kfree_rcu(key, rcu_head);
>  }

That looks wrong.  OK, so you want umount() to proceed; fine, no problem
with that.  However, what happens if the final mntput() hits while you
are just approaching that path_put_unpin()?  ->kill() will be triggered,
and it would bloody better
	a) make sure that expkey_put() is called for that key if it hadn't
already been done and
	b) do not return until such expkey_put() completes.  Including the
ones that might have been already entered by the time we'd got to ->kill().

Am I missing something subtle here?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro June 6, 2015, 2:21 a.m. UTC | #2
On Fri, Jun 05, 2015 at 04:02:13PM +0100, Al Viro wrote:
> On Sun, May 24, 2015 at 11:10:50PM +0800, Kinglong Mee wrote:
> > --- a/fs/nfsd/export.c
> > +++ b/fs/nfsd/export.c
> > @@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref)
> >  
> >  	if (test_bit(CACHE_VALID, &key->h.flags) &&
> >  	    !test_bit(CACHE_NEGATIVE, &key->h.flags))
> > -		path_put(&key->ek_path);
> > +		path_put_unpin(&key->ek_path, &key->ek_pin);
> >  	auth_domain_put(key->ek_client);
> > -	kfree(key);
> > +	kfree_rcu(key, rcu_head);
> >  }
> 
> That looks wrong.  OK, so you want umount() to proceed; fine, no problem
> with that.  However, what happens if the final mntput() hits while you
> are just approaching that path_put_unpin()?  ->kill() will be triggered,
> and it would bloody better
> 	a) make sure that expkey_put() is called for that key if it hadn't
> already been done and
> 	b) do not return until such expkey_put() completes.  Including the
> ones that might have been already entered by the time we'd got to ->kill().
> 
> Am I missing something subtle here?

Having looked through that code...  It *is* wrong.  Note that the normal
approach is to have pin_remove() called via pin_kill(), directly or triggered
from group_pin_kill() and/or cleanup_mnt() on the mount it's attached to.
pin_remove() should never be called outside of ->kill() callbacks.  It should
be called at the point where you are OK with fs being shut down.

The fundamental reason why it's broken is different, though - you *can't*
grab a reference if all you've got is a pin.  By the time the callback is
called, the mount in question is already irretrievably committed to being
killed.  There's one hell of a wide window between the point of no return
and the point where you are notified of anything, and that's by design -
you might very well have had several mounts doomed by a syscall and they
all get through cleanup_mnt() just before return to userland.  One by one.
So between the point where this puppy is doomed and the call of your callback
there might have been several filesystems going through shutdown, with tons
of IO, waiting for remote servers, etc.

We could add a primitive that would _try_ to grab a reference - that can
be done (lock_mount_hash(), check if it has MNT_DOOMED or MNT_SYNC_UMOUNT,
fail if it does, otherwise mnt_add_count(mnt, 1) and succeed, doing
unlock_mount_hash() on both exit paths).  HOWEVER, you'll need to think
very carefully where to use that primitive - unlike mntget() it _can_
fail and lock_mount_hash() can inflict quite a bit of cacheline pingpong
if used heavily.

Could you give details on lifecycle of those objects, including the stages
at which we might try to grab references?  Combination of such primitive with
a pin (doing just "NULL the references to vfsmount/dentry, do dput() on
what that dentry used to be and call pin_remove()") might work, if the
lifecycle is good enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kinglong Mee June 6, 2015, 1:38 p.m. UTC | #3
On 6/6/2015 10:21 AM, Al Viro wrote:
> On Fri, Jun 05, 2015 at 04:02:13PM +0100, Al Viro wrote:
>> On Sun, May 24, 2015 at 11:10:50PM +0800, Kinglong Mee wrote:
>>> --- a/fs/nfsd/export.c
>>> +++ b/fs/nfsd/export.c
>>> @@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref)
>>>  
>>>  	if (test_bit(CACHE_VALID, &key->h.flags) &&
>>>  	    !test_bit(CACHE_NEGATIVE, &key->h.flags))
>>> -		path_put(&key->ek_path);
>>> +		path_put_unpin(&key->ek_path, &key->ek_pin);
>>>  	auth_domain_put(key->ek_client);
>>> -	kfree(key);
>>> +	kfree_rcu(key, rcu_head);
>>>  }
>>
>> That looks wrong.  OK, so you want umount() to proceed; fine, no problem
>> with that.  However, what happens if the final mntput() hits while you
>> are just approaching that path_put_unpin()?  ->kill() will be triggered,
>> and it would bloody better
>> 	a) make sure that expkey_put() is called for that key if it hadn't
>> already been done and
>> 	b) do not return until such expkey_put() completes.  Including the
>> ones that might have been already entered by the time we'd got to ->kill().

You are right.
Sorry for my fault, the above patch misses caring the race.

>>
>> Am I missing something subtle here?
> 
> Having looked through that code...  It *is* wrong.  Note that the normal
> approach is to have pin_remove() called via pin_kill(), directly or triggered
> from group_pin_kill() and/or cleanup_mnt() on the mount it's attached to.
> pin_remove() should never be called outside of ->kill() callbacks.  It should
> be called at the point where you are OK with fs being shut down.

Thank you very much for your comments.
I will try to using fs_pin as the restrict.

> 
> The fundamental reason why it's broken is different, though - you *can't*
> grab a reference if all you've got is a pin.  By the time the callback is
> called, the mount in question is already irretrievably committed to being
> killed.  There's one hell of a wide window between the point of no return
> and the point where you are notified of anything, and that's by design -
> you might very well have had several mounts doomed by a syscall and they
> all get through cleanup_mnt() just before return to userland.  One by one.
> So between the point where this puppy is doomed and the call of your callback
> there might have been several filesystems going through shutdown, with tons
> of IO, waiting for remote servers, etc.
> 
> We could add a primitive that would _try_ to grab a reference - that can
> be done (lock_mount_hash(), check if it has MNT_DOOMED or MNT_SYNC_UMOUNT,
> fail if it does, otherwise mnt_add_count(mnt, 1) and succeed, doing
> unlock_mount_hash() on both exit paths).  HOWEVER, you'll need to think
> very carefully where to use that primitive - unlike mntget() it _can_
> fail and lock_mount_hash() can inflict quite a bit of cacheline pingpong
> if used heavily.

Do you mean adding a new feature?

> 
> Could you give details on lifecycle of those objects, including the stages
> at which we might try to grab references?  Combination of such primitive with
> a pin (doing just "NULL the references to vfsmount/dentry, do dput() on
> what that dentry used to be and call pin_remove()") might work, if the
> lifecycle is good enough.

NFSD has two caches named expkey and export which are managed by sunrpc cache
fundamental. I will only explain export following for expkey is similar as export.

struct cache_head {
	struct kref     ref;
	... ... 
};
struct svc_export {
	struct cache_head       h;
	struct path             ex_path;
	... ...
};

1. svc_export has a reference, will be freed when the reference is decreased to zero.
2. ex_path must be put when freed (Want change mntget to fs_pin for ex_path's vfsmnt).
3. With fs_pin, there are two logic (one is the normal logic, the other is pin_kill)
   which can cause free svc_export.
4. The reference of the normal logic is zero, but the pin_kill logic is not zero.
   the second logic will decrease the reference indirectly, if decrease to zero, 
   umount will go though the normal logic's code, at last frees the svc_export;
   if not zero, umount must don't free the svc_export.

I try to solve the window as,
struct svc_export {
	struct cache_head       h;
	struct path             ex_path;
	... ...
	struct fs_pin           ex_pin;
        struct rcu_head         rcu_head;

        /* For cache_put and fs umounting window */
        struct completion       ex_done;
        struct work_struct      ex_work;
};

1. ex_done is for umount waiting the reference is decreased to zero.
2. ex_work is for umount decrease the reference indirectly.
3. The normal logic don't free the svc_export, calls complete() and
   go though pin_kill() logic as,
   (svc_export_put will be called when reference is decreased to zero)
   
   static void svc_export_put(struct kref *ref)
   {
          struct svc_export *exp = container_of(ref, struct svc_export, h.ref);

          rcu_read_lock();
          complete(&exp->ex_done);
          pin_kill(&exp->ex_pin);
   }

4. pin_kill() logic will schedules to decrease the reference though ex_work,
   and at last path_put_unpin and destroy the svc_export.

   static void export_pin_kill(struct fs_pin *pin)
   {
          struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);

          if (!completion_done(&exp->ex_done)) {
                  schedule_work(&exp->ex_work);
                  wait_for_completion(&exp->ex_done);
          }
          path_put_unpin(&exp->ex_path, &exp->ex_pin);
          svc_export_destroy(exp);
   }

The full patches will be sent later. Thanks again.

thanks,
Kinglong Mee
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index f79521a..cc34b0b 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -43,9 +43,9 @@  static void expkey_put(struct kref *ref)
 
 	if (test_bit(CACHE_VALID, &key->h.flags) &&
 	    !test_bit(CACHE_NEGATIVE, &key->h.flags))
-		path_put(&key->ek_path);
+		path_put_unpin(&key->ek_path, &key->ek_pin);
 	auth_domain_put(key->ek_client);
-	kfree(key);
+	kfree_rcu(key, rcu_head);
 }
 
 static void expkey_request(struct cache_detail *cd,
@@ -83,7 +83,7 @@  static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen)
 		return -EINVAL;
 	mesg[mlen-1] = 0;
 
-	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
 	err = -ENOMEM;
 	if (!buf)
 		goto out;
@@ -120,6 +120,7 @@  static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen)
 		goto out;
 
 	key.ek_client = dom;	
+	key.cd = cd;
 	key.ek_fsidtype = fsidtype;
 	memcpy(key.ek_fsid, buf, len);
 
@@ -210,6 +211,13 @@  static inline void expkey_init(struct cache_head *cnew,
 	new->ek_fsidtype = item->ek_fsidtype;
 
 	memcpy(new->ek_fsid, item->ek_fsid, sizeof(new->ek_fsid));
+	new->cd = item->cd;
+}
+
+static void expkey_pin_kill(struct fs_pin *pin)
+{
+	struct svc_expkey *key = container_of(pin, struct svc_expkey, ek_pin);
+	cache_force_expire(key->cd, &key->h);
 }
 
 static inline void expkey_update(struct cache_head *cnew,
@@ -218,13 +226,14 @@  static inline void expkey_update(struct cache_head *cnew,
 	struct svc_expkey *new = container_of(cnew, struct svc_expkey, h);
 	struct svc_expkey *item = container_of(citem, struct svc_expkey, h);
 
+	init_fs_pin(&new->ek_pin, expkey_pin_kill);
 	new->ek_path = item->ek_path;
-	path_get(&item->ek_path);
+	path_get_pin(&new->ek_path, &new->ek_pin);
 }
 
 static struct cache_head *expkey_alloc(void)
 {
-	struct svc_expkey *i = kmalloc(sizeof(*i), GFP_KERNEL);
+	struct svc_expkey *i = kzalloc(sizeof(*i), GFP_KERNEL);
 	if (i)
 		return &i->h;
 	else
@@ -309,11 +318,16 @@  static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc)
 static void svc_export_put(struct kref *ref)
 {
 	struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
-	path_put(&exp->ex_path);
+
+	if (EX_ALLOW_UMOUNT(exp))
+		path_put_unpin(&exp->ex_path, &exp->ex_pin);
+	else
+		path_put(&exp->ex_path);
+
 	auth_domain_put(exp->ex_client);
 	nfsd4_fslocs_free(&exp->ex_fslocs);
 	kfree(exp->ex_uuid);
-	kfree(exp);
+	kfree_rcu(exp, rcu_head);
 }
 
 static void svc_export_request(struct cache_detail *cd,
@@ -520,7 +534,7 @@  static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
 		return -EINVAL;
 	mesg[mlen-1] = 0;
 
-	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
 	if (!buf)
 		return -ENOMEM;
 
@@ -694,15 +708,23 @@  static int svc_export_match(struct cache_head *a, struct cache_head *b)
 		path_equal(&orig->ex_path, &new->ex_path);
 }
 
+static void export_pin_kill(struct fs_pin *pin)
+{
+	struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);
+	cache_force_expire(exp->cd, &exp->h);
+}
+
 static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
 {
 	struct svc_export *new = container_of(cnew, struct svc_export, h);
 	struct svc_export *item = container_of(citem, struct svc_export, h);
 
+	init_fs_pin(&new->ex_pin, export_pin_kill);
 	kref_get(&item->ex_client->ref);
+	new->ex_flags = NFSEXP_ALLOW_UMOUNT;
 	new->ex_client = item->ex_client;
 	new->ex_path = item->ex_path;
-	path_get(&item->ex_path);
+	path_get_pin(&new->ex_path, &new->ex_pin);
 	new->ex_fslocs.locations = NULL;
 	new->ex_fslocs.locations_count = 0;
 	new->ex_fslocs.migrated = 0;
@@ -717,6 +739,14 @@  static void export_update(struct cache_head *cnew, struct cache_head *citem)
 	struct svc_export *item = container_of(citem, struct svc_export, h);
 	int i;
 
+	if (!EX_ALLOW_UMOUNT(item)) {
+		path_get(&new->ex_path);
+		if (EX_ALLOW_UMOUNT(new))
+			path_put_unpin(&new->ex_path, &new->ex_pin);
+		else
+			path_put(&new->ex_path);
+	}
+
 	new->ex_flags = item->ex_flags;
 	new->ex_anon_uid = item->ex_anon_uid;
 	new->ex_anon_gid = item->ex_anon_gid;
@@ -740,7 +770,7 @@  static void export_update(struct cache_head *cnew, struct cache_head *citem)
 
 static struct cache_head *svc_export_alloc(void)
 {
-	struct svc_export *i = kmalloc(sizeof(*i), GFP_KERNEL);
+	struct svc_export *i = kzalloc(sizeof(*i), GFP_KERNEL);
 	if (i)
 		return &i->h;
 	else
@@ -811,6 +841,7 @@  exp_find_key(struct cache_detail *cd, struct auth_domain *clp, int fsid_type,
 
 	key.ek_client = clp;
 	key.ek_fsidtype = fsid_type;
+	key.cd = cd;
 	memcpy(key.ek_fsid, fsidv, key_len(fsid_type));
 
 	ek = svc_expkey_lookup(cd, &key);
@@ -1159,6 +1190,7 @@  static struct flags {
 	{ NFSEXP_NOAUTHNLM, {"insecure_locks", ""}},
 	{ NFSEXP_V4ROOT, {"v4root", ""}},
 	{ NFSEXP_PNFS, {"pnfs", ""}},
+	{ NFSEXP_ALLOW_UMOUNT, {"allow_umount", ""}},
 	{ 0, {"", ""}}
 };
 
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index 1f52bfc..1134875 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -4,6 +4,7 @@ 
 #ifndef NFSD_EXPORT_H
 #define NFSD_EXPORT_H
 
+#include <linux/fs_pin.h>
 #include <linux/sunrpc/cache.h>
 #include <uapi/linux/nfsd/export.h>
 
@@ -46,6 +47,8 @@  struct exp_flavor_info {
 
 struct svc_export {
 	struct cache_head	h;
+	struct cache_detail	*cd;
+
 	struct auth_domain *	ex_client;
 	int			ex_flags;
 	struct path		ex_path;
@@ -58,7 +61,9 @@  struct svc_export {
 	struct exp_flavor_info	ex_flavors[MAX_SECINFO_LIST];
 	enum pnfs_layouttype	ex_layout_type;
 	struct nfsd4_deviceid_map *ex_devid_map;
-	struct cache_detail	*cd;
+
+	struct fs_pin		ex_pin;
+	struct rcu_head		rcu_head;
 };
 
 /* an "export key" (expkey) maps a filehandlefragement to an
@@ -67,17 +72,21 @@  struct svc_export {
  */
 struct svc_expkey {
 	struct cache_head	h;
+	struct cache_detail	*cd;
 
 	struct auth_domain *	ek_client;
 	int			ek_fsidtype;
 	u32			ek_fsid[6];
 
 	struct path		ek_path;
+	struct fs_pin		ek_pin;
+	struct rcu_head		rcu_head;
 };
 
 #define EX_ISSYNC(exp)		(!((exp)->ex_flags & NFSEXP_ASYNC))
 #define EX_NOHIDE(exp)		((exp)->ex_flags & NFSEXP_NOHIDE)
 #define EX_WGATHER(exp)		((exp)->ex_flags & NFSEXP_GATHERED_WRITES)
+#define EX_ALLOW_UMOUNT(exp)	((exp)->ex_flags & NFSEXP_ALLOW_UMOUNT)
 
 int nfsexp_flags(struct svc_rqst *rqstp, struct svc_export *exp);
 __be32 check_nfsd_access(struct svc_export *exp, struct svc_rqst *rqstp);
diff --git a/include/uapi/linux/nfsd/export.h b/include/uapi/linux/nfsd/export.h
index 0df7bd5..61aa8bb 100644
--- a/include/uapi/linux/nfsd/export.h
+++ b/include/uapi/linux/nfsd/export.h
@@ -51,9 +51,10 @@ 
  */
 #define	NFSEXP_V4ROOT		0x10000
 #define NFSEXP_PNFS		0x20000
+#define NFSEXP_ALLOW_UMOUNT	0x40000
 
 /* All flags that we claim to support.  (Note we don't support NOACL.) */
-#define NFSEXP_ALLFLAGS		0x3FE7F
+#define NFSEXP_ALLFLAGS		0x7FE7F
 
 /* The flags that may vary depending on security flavor: */
 #define NFSEXP_SECINFO_FLAGS	(NFSEXP_READONLY | NFSEXP_ROOTSQUASH \