diff mbox

[RFC] vfs: shutdown lease notifications on file close

Message ID 150791017031.39579.15556617763195969964.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams Oct. 13, 2017, 3:56 p.m. UTC
While implementing MAP_DIRECT, an mmap flag that arranges for an
FL_LAYOUT lease to be established, Al noted:

    You are not even guaranteed that descriptor will remain be still
    open by the time you pass it down to your helper, nevermind the
    moment when event actually happens...

The first problem can be solved with an fd{get,put} at mmap
{entry,exit}. The second problem appears to be a general issue.

Leases follow the lifetime of the inode, so it is possible for a lease
to be broken after the file is closed. When that happens userspace may
get a notification on a stale fd. Of course it is not recommended that a
process close a file descriptor with an active lease, but if it does we
should assume that the notification is not needed either. Walk leases at
close time and invalidate any pending fasync instances.

Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Reported-by: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/fcntl.c         |   24 +++++++++++++++++++++++-
 fs/file.c          |    1 +
 fs/locks.c         |   22 ++++++++++++++++++++++
 include/linux/fs.h |    7 +++++++
 4 files changed, 53 insertions(+), 1 deletion(-)

Comments

Al Viro Oct. 13, 2017, 5:01 p.m. UTC | #1
On Fri, Oct 13, 2017 at 08:56:10AM -0700, Dan Williams wrote:
> While implementing MAP_DIRECT, an mmap flag that arranges for an
> FL_LAYOUT lease to be established, Al noted:
> 
>     You are not even guaranteed that descriptor will remain be still
>     open by the time you pass it down to your helper, nevermind the
>     moment when event actually happens...
> 
> The first problem can be solved with an fd{get,put} at mmap
> {entry,exit}.

Huh?  fdget() does *NOT* guarantee that descriptor won't get closed.  What
it does is guarantee that struct file won't get closed under you, which
is nowhere near the same thing.  And while we are at it, it certainly
_is_ called by mmap()...

> The second problem appears to be a general issue.
> 
> Leases follow the lifetime of the inode, so it is possible for a lease
> to be broken after the file is closed. When that happens userspace may
> get a notification on a stale fd. Of course it is not recommended that a
> process close a file descriptor with an active lease, but if it does we
> should assume that the notification is not needed either. Walk leases at
> close time and invalidate any pending fasync instances.

What the hell is special about close(2) and not, e.g. dup2(2)?  Or execve(2)
triggering close-on-exec, etc...  Besides, you are changing a user-visible
behaviour here.  Suppose your process forks and the child closes all
descriptors; should that stop SIGIO delivery to the parent?

Let's step back for a minute; could you describe how the userland is supposed
to use that thing?
Dan Williams Oct. 13, 2017, 5:43 p.m. UTC | #2
On Fri, Oct 13, 2017 at 10:01 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Oct 13, 2017 at 08:56:10AM -0700, Dan Williams wrote:
>> While implementing MAP_DIRECT, an mmap flag that arranges for an
>> FL_LAYOUT lease to be established, Al noted:
>>
>>     You are not even guaranteed that descriptor will remain be still
>>     open by the time you pass it down to your helper, nevermind the
>>     moment when event actually happens...
>>
>> The first problem can be solved with an fd{get,put} at mmap
>> {entry,exit}.
>
> Huh?  fdget() does *NOT* guarantee that descriptor won't get closed.  What
> it does is guarantee that struct file won't get closed under you, which
> is nowhere near the same thing.  And while we are at it, it certainly
> _is_ called by mmap()...
>
>> The second problem appears to be a general issue.
>>
>> Leases follow the lifetime of the inode, so it is possible for a lease
>> to be broken after the file is closed. When that happens userspace may
>> get a notification on a stale fd. Of course it is not recommended that a
>> process close a file descriptor with an active lease, but if it does we
>> should assume that the notification is not needed either. Walk leases at
>> close time and invalidate any pending fasync instances.
>
> What the hell is special about close(2) and not, e.g. dup2(2)?  Or execve(2)
> triggering close-on-exec, etc...  Besides, you are changing a user-visible
> behaviour here.  Suppose your process forks and the child closes all
> descriptors; should that stop SIGIO delivery to the parent?
>
> Let's step back for a minute; could you describe how the userland is supposed
> to use that thing?

MAP_DIRECT is a meant as a way to safely pass DAX mappings of a file
to the RDMA sub-system, or any sub-system that follows a memory
registration design pattern. RDMA expects that once it has done
get_user_pages() that it has exclusive access to the memory backing
the file mapping indefinitely. With page cache backed file mappings we
can truncate and hole punch the file at will and the RDMA operations
will continue to pages that are no longer part of the file. Yes, that
breaks coherency, but it otherwise does not cause damage to unrelated
file blocks. With DAX we do not have the luxury of an indirect page
for the RDMA to land the operations are going straight to file blocks
in persistent memory.

With MAP_DIRECT the proposal is that when the RDMA memory registration
code sees 'vma_is_dax(vma) == true' it calls a new ->lease_direct()
vm_operation to take an FL_LAYOUT lease against the file to protect
against truncate / fallocate. Lease expiration triggers a callback to
redirect or shutdown RDMA. The filesystem mmap implemantation also
arranges for an FL_LAYOUT lease to be taken at mmap time when the fd
is available to setup a SIGIO notification.

If we don't take a lease at mmap time then we would need to develop a
notification mechanism that is specific to the RDMA code, and using
SIGIO on the mmap fd seemed a more generic solution to me.
Dan Williams Oct. 13, 2017, 6 p.m. UTC | #3
On Fri, Oct 13, 2017 at 10:43 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Oct 13, 2017 at 10:01 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Oct 13, 2017 at 08:56:10AM -0700, Dan Williams wrote:
>>> While implementing MAP_DIRECT, an mmap flag that arranges for an
>>> FL_LAYOUT lease to be established, Al noted:
>>>
>>>     You are not even guaranteed that descriptor will remain be still
>>>     open by the time you pass it down to your helper, nevermind the
>>>     moment when event actually happens...
>>>
>>> The first problem can be solved with an fd{get,put} at mmap
>>> {entry,exit}.
>>
>> Huh?  fdget() does *NOT* guarantee that descriptor won't get closed.  What
>> it does is guarantee that struct file won't get closed under you, which
>> is nowhere near the same thing.  And while we are at it, it certainly
>> _is_ called by mmap()...
>>
>>> The second problem appears to be a general issue.
>>>
>>> Leases follow the lifetime of the inode, so it is possible for a lease
>>> to be broken after the file is closed. When that happens userspace may
>>> get a notification on a stale fd. Of course it is not recommended that a
>>> process close a file descriptor with an active lease, but if it does we
>>> should assume that the notification is not needed either. Walk leases at
>>> close time and invalidate any pending fasync instances.
>>
>> What the hell is special about close(2) and not, e.g. dup2(2)?  Or execve(2)
>> triggering close-on-exec, etc...  Besides, you are changing a user-visible
>> behaviour here.  Suppose your process forks and the child closes all
>> descriptors; should that stop SIGIO delivery to the parent?
>>
>> Let's step back for a minute; could you describe how the userland is supposed
>> to use that thing?
>
> MAP_DIRECT is a meant as a way to safely pass DAX mappings of a file
> to the RDMA sub-system...

Al, before you spend any more time thinking about this let me close
with the RDMA folks on a notification scheme that works for them.
Jeff Layton Oct. 13, 2017, 6:30 p.m. UTC | #4
On Fri, 2017-10-13 at 08:56 -0700, Dan Williams wrote:
> While implementing MAP_DIRECT, an mmap flag that arranges for an
> FL_LAYOUT lease to be established, Al noted:
> 
>     You are not even guaranteed that descriptor will remain be still
>     open by the time you pass it down to your helper, nevermind the
>     moment when event actually happens...
> 
> The first problem can be solved with an fd{get,put} at mmap
> {entry,exit}. The second problem appears to be a general issue.
> 
> Leases follow the lifetime of the inode, so it is possible for a lease
> to be broken after the file is closed. When that happens userspace may
> get a notification on a stale fd. Of course it is not recommended that a
> process close a file descriptor with an active lease, but if it does we
> should assume that the notification is not needed either. Walk leases at
> close time and invalidate any pending fasync instances.
> 
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Reported-by: Alexander Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/fcntl.c         |   24 +++++++++++++++++++++++-
>  fs/file.c          |    1 +
>  fs/locks.c         |   22 ++++++++++++++++++++++
>  include/linux/fs.h |    7 +++++++
>  4 files changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 448a1119f0be..03612c363b90 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -974,6 +974,28 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap
>  
>  EXPORT_SYMBOL(fasync_helper);
>  
> +static void __fasync_silence(int fd, struct fasync_struct *fa)
> +{
> +	while (fa) {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&fa->fa_lock, flags);
> +		if (fa->fa_file && fa->fa_fd == fd)
> +			fa->fa_fd = -1;
> +		spin_unlock_irqrestore(&fa->fa_lock, flags);
> +		fa = rcu_dereference(fa->fa_next);
> +	}
> +}
> +
> +void fasync_silence(int fd, struct fasync_struct **fp)
> +{
> +	if (*fp) {
> +		rcu_read_lock();
> +		__fasync_silence(fd, *fp);
> +		rcu_read_unlock();
> +	}
> +}
> +
>  /*
>   * rcu_read_lock() is held
>   */
> @@ -989,7 +1011,7 @@ static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
>  			return;
>  		}
>  		spin_lock_irqsave(&fa->fa_lock, flags);
> -		if (fa->fa_file) {
> +		if (fa->fa_file && fa->fa_fd >= 0) {
>  			fown = &fa->fa_file->f_owner;
>  			/* Don't send SIGURG to processes which have not set a
>  			   queued signum: SIGURG has its own default signalling
> diff --git a/fs/file.c b/fs/file.c
> index 1fc7fbbb4510..b90969bf1f94 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -633,6 +633,7 @@ int __close_fd(struct files_struct *files, unsigned fd)
>  	rcu_assign_pointer(fdt->fd[fd], NULL);
>  	__clear_close_on_exec(fd, fdt);
>  	__put_unused_fd(files, fd);
> +	locks_silence_lease(fd, file);
>  	spin_unlock(&files->file_lock);
>  	return filp_close(file, files);
>  
> diff --git a/fs/locks.c b/fs/locks.c
> index 1bd71c4d663a..ca93e4dbdd90 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2573,6 +2573,28 @@ void locks_remove_file(struct file *filp)
>  	spin_unlock(&ctx->flc_lock);
>  }
>  
> +/*
> + * The fd is assumed to valid, i.e. this routine is called in the
> + * filp_close() path where the state of fd is known.
> + */
> +void locks_silence_lease(int fd, struct file *filp)
> +{
> +	struct file_lock_context *ctx;
> +	struct file_lock *fl;
> +
> +	ctx = smp_load_acquire(&locks_inode(filp)->i_flctx);
> +	if (!ctx || list_empty_careful(&ctx->flc_lease))
> +		return;
> +
> +	spin_lock(&ctx->flc_lock);
> +	list_for_each_entry(fl, &ctx->flc_lease, fl_list) {
> +		if (fl->fl_pid != current->tgid)
> +			continue;
> +		fasync_silence(fd, &fl->fl_fasync);
> +	}
> +	spin_unlock(&ctx->flc_lock);
> +}
> +
>  /**
>   *	posix_unblock_lock - stop waiting for a file lock
>   *	@waiter: the lock which was waiting
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 17e0e899e184..019853a7b2cd 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1076,6 +1076,7 @@ extern void locks_copy_lock(struct file_lock *, struct file_lock *);
>  extern void locks_copy_conflock(struct file_lock *, struct file_lock *);
>  extern void locks_remove_posix(struct file *, fl_owner_t);
>  extern void locks_remove_file(struct file *);
> +extern void locks_silence_lease(int, struct file *);
>  extern void locks_release_private(struct file_lock *);
>  extern void posix_test_lock(struct file *, struct file_lock *);
>  extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
> @@ -1153,6 +1154,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner)
>  	return;
>  }
>  
> +static inline void locks_silence_lease(int fd, struct file *filp)
> +{
> +	return;
> +}
> +
>  static inline void locks_remove_file(struct file *filp)
>  {
>  	return;
> @@ -1260,6 +1266,7 @@ extern struct fasync_struct *fasync_insert_entry(int, struct file *, struct fasy
>  extern int fasync_remove_entry(struct file *, struct fasync_struct **);
>  extern struct fasync_struct *fasync_alloc(void);
>  extern void fasync_free(struct fasync_struct *);
> +extern void fasync_silence(int, struct fasync_struct **);
>  
>  /* can be called from interrupts */
>  extern void kill_fasync(struct fasync_struct **, int, int);


All remaning file leases associated with a particular struct file should
be released at last fput via locks_remove_lease.

So yes, you might get the SIGIO after you've already closed the file if
there are lingering filp references out there. You might even get the
signal after you've already recycled the the fd. That's potentially a
real problem, IMO (and I guess it's the one you're wanting to address).

FL_LAYOUT leases are not exposed to userland right now, so I think we
can change the semantics there, as long as it doesn't break knfsd's use
of them (and I wouldn't think that it would). FL_LEASE is more
debatable, just because there are userland callers out there (as Al
points out).

Ok, just saw the note about waiting until you talk to the RDMA folks.
Let us know if you want to look at this more closely.
diff mbox

Patch

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 448a1119f0be..03612c363b90 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -974,6 +974,28 @@  int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap
 
 EXPORT_SYMBOL(fasync_helper);
 
+static void __fasync_silence(int fd, struct fasync_struct *fa)
+{
+	while (fa) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&fa->fa_lock, flags);
+		if (fa->fa_file && fa->fa_fd == fd)
+			fa->fa_fd = -1;
+		spin_unlock_irqrestore(&fa->fa_lock, flags);
+		fa = rcu_dereference(fa->fa_next);
+	}
+}
+
+void fasync_silence(int fd, struct fasync_struct **fp)
+{
+	if (*fp) {
+		rcu_read_lock();
+		__fasync_silence(fd, *fp);
+		rcu_read_unlock();
+	}
+}
+
 /*
  * rcu_read_lock() is held
  */
@@ -989,7 +1011,7 @@  static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
 			return;
 		}
 		spin_lock_irqsave(&fa->fa_lock, flags);
-		if (fa->fa_file) {
+		if (fa->fa_file && fa->fa_fd >= 0) {
 			fown = &fa->fa_file->f_owner;
 			/* Don't send SIGURG to processes which have not set a
 			   queued signum: SIGURG has its own default signalling
diff --git a/fs/file.c b/fs/file.c
index 1fc7fbbb4510..b90969bf1f94 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -633,6 +633,7 @@  int __close_fd(struct files_struct *files, unsigned fd)
 	rcu_assign_pointer(fdt->fd[fd], NULL);
 	__clear_close_on_exec(fd, fdt);
 	__put_unused_fd(files, fd);
+	locks_silence_lease(fd, file);
 	spin_unlock(&files->file_lock);
 	return filp_close(file, files);
 
diff --git a/fs/locks.c b/fs/locks.c
index 1bd71c4d663a..ca93e4dbdd90 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2573,6 +2573,28 @@  void locks_remove_file(struct file *filp)
 	spin_unlock(&ctx->flc_lock);
 }
 
+/*
+ * The fd is assumed to valid, i.e. this routine is called in the
+ * filp_close() path where the state of fd is known.
+ */
+void locks_silence_lease(int fd, struct file *filp)
+{
+	struct file_lock_context *ctx;
+	struct file_lock *fl;
+
+	ctx = smp_load_acquire(&locks_inode(filp)->i_flctx);
+	if (!ctx || list_empty_careful(&ctx->flc_lease))
+		return;
+
+	spin_lock(&ctx->flc_lock);
+	list_for_each_entry(fl, &ctx->flc_lease, fl_list) {
+		if (fl->fl_pid != current->tgid)
+			continue;
+		fasync_silence(fd, &fl->fl_fasync);
+	}
+	spin_unlock(&ctx->flc_lock);
+}
+
 /**
  *	posix_unblock_lock - stop waiting for a file lock
  *	@waiter: the lock which was waiting
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17e0e899e184..019853a7b2cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1076,6 +1076,7 @@  extern void locks_copy_lock(struct file_lock *, struct file_lock *);
 extern void locks_copy_conflock(struct file_lock *, struct file_lock *);
 extern void locks_remove_posix(struct file *, fl_owner_t);
 extern void locks_remove_file(struct file *);
+extern void locks_silence_lease(int, struct file *);
 extern void locks_release_private(struct file_lock *);
 extern void posix_test_lock(struct file *, struct file_lock *);
 extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
@@ -1153,6 +1154,11 @@  static inline void locks_remove_posix(struct file *filp, fl_owner_t owner)
 	return;
 }
 
+static inline void locks_silence_lease(int fd, struct file *filp)
+{
+	return;
+}
+
 static inline void locks_remove_file(struct file *filp)
 {
 	return;
@@ -1260,6 +1266,7 @@  extern struct fasync_struct *fasync_insert_entry(int, struct file *, struct fasy
 extern int fasync_remove_entry(struct file *, struct fasync_struct **);
 extern struct fasync_struct *fasync_alloc(void);
 extern void fasync_free(struct fasync_struct *);
+extern void fasync_silence(int, struct fasync_struct **);
 
 /* can be called from interrupts */
 extern void kill_fasync(struct fasync_struct **, int, int);