Message ID | 150791017031.39579.15556617763195969964.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Oct 13, 2017 at 08:56:10AM -0700, Dan Williams wrote: > While implementing MAP_DIRECT, an mmap flag that arranges for an > FL_LAYOUT lease to be established, Al noted: > > You are not even guaranteed that descriptor will remain be still > open by the time you pass it down to your helper, nevermind the > moment when event actually happens... > > The first problem can be solved with an fd{get,put} at mmap > {entry,exit}. Huh? fdget() does *NOT* guarantee that descriptor won't get closed. What it does is guarantee that struct file won't get closed under you, which is nowhere near the same thing. And while we are at it, it certainly _is_ called by mmap()... > The second problem appears to be a general issue. > > Leases follow the lifetime of the inode, so it is possible for a lease > to be broken after the file is closed. When that happens userspace may > get a notification on a stale fd. Of course it is not recommended that a > process close a file descriptor with an active lease, but if it does we > should assume that the notification is not needed either. Walk leases at > close time and invalidate any pending fasync instances. What the hell is special about close(2) and not, e.g. dup2(2)? Or execve(2) triggering close-on-exec, etc... Besides, you are changing a user-visible behaviour here. Suppose your process forks and the child closes all descriptors; should that stop SIGIO delivery to the parent? Let's step back for a minute; could you describe how the userland is supposed to use that thing?
On Fri, Oct 13, 2017 at 10:01 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Oct 13, 2017 at 08:56:10AM -0700, Dan Williams wrote: >> While implementing MAP_DIRECT, an mmap flag that arranges for an >> FL_LAYOUT lease to be established, Al noted: >> >> You are not even guaranteed that descriptor will remain be still >> open by the time you pass it down to your helper, nevermind the >> moment when event actually happens... >> >> The first problem can be solved with an fd{get,put} at mmap >> {entry,exit}. > > Huh? fdget() does *NOT* guarantee that descriptor won't get closed. What > it does is guarantee that struct file won't get closed under you, which > is nowhere near the same thing. And while we are at it, it certainly > _is_ called by mmap()... > >> The second problem appears to be a general issue. >> >> Leases follow the lifetime of the inode, so it is possible for a lease >> to be broken after the file is closed. When that happens userspace may >> get a notification on a stale fd. Of course it is not recommended that a >> process close a file descriptor with an active lease, but if it does we >> should assume that the notification is not needed either. Walk leases at >> close time and invalidate any pending fasync instances. > > What the hell is special about close(2) and not, e.g. dup2(2)? Or execve(2) > triggering close-on-exec, etc... Besides, you are changing a user-visible > behaviour here. Suppose your process forks and the child closes all > descriptors; should that stop SIGIO delivery to the parent? > > Let's step back for a minute; could you describe how the userland is supposed > to use that thing? MAP_DIRECT is a meant as a way to safely pass DAX mappings of a file to the RDMA sub-system, or any sub-system that follows a memory registration design pattern. RDMA expects that once it has done get_user_pages() that it has exclusive access to the memory backing the file mapping indefinitely. With page cache backed file mappings we can truncate and hole punch the file at will and the RDMA operations will continue to pages that are no longer part of the file. Yes, that breaks coherency, but it otherwise does not cause damage to unrelated file blocks. With DAX we do not have the luxury of an indirect page for the RDMA to land the operations are going straight to file blocks in persistent memory. With MAP_DIRECT the proposal is that when the RDMA memory registration code sees 'vma_is_dax(vma) == true' it calls a new ->lease_direct() vm_operation to take an FL_LAYOUT lease against the file to protect against truncate / fallocate. Lease expiration triggers a callback to redirect or shutdown RDMA. The filesystem mmap implemantation also arranges for an FL_LAYOUT lease to be taken at mmap time when the fd is available to setup a SIGIO notification. If we don't take a lease at mmap time then we would need to develop a notification mechanism that is specific to the RDMA code, and using SIGIO on the mmap fd seemed a more generic solution to me.
On Fri, Oct 13, 2017 at 10:43 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Fri, Oct 13, 2017 at 10:01 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Fri, Oct 13, 2017 at 08:56:10AM -0700, Dan Williams wrote: >>> While implementing MAP_DIRECT, an mmap flag that arranges for an >>> FL_LAYOUT lease to be established, Al noted: >>> >>> You are not even guaranteed that descriptor will remain be still >>> open by the time you pass it down to your helper, nevermind the >>> moment when event actually happens... >>> >>> The first problem can be solved with an fd{get,put} at mmap >>> {entry,exit}. >> >> Huh? fdget() does *NOT* guarantee that descriptor won't get closed. What >> it does is guarantee that struct file won't get closed under you, which >> is nowhere near the same thing. And while we are at it, it certainly >> _is_ called by mmap()... >> >>> The second problem appears to be a general issue. >>> >>> Leases follow the lifetime of the inode, so it is possible for a lease >>> to be broken after the file is closed. When that happens userspace may >>> get a notification on a stale fd. Of course it is not recommended that a >>> process close a file descriptor with an active lease, but if it does we >>> should assume that the notification is not needed either. Walk leases at >>> close time and invalidate any pending fasync instances. >> >> What the hell is special about close(2) and not, e.g. dup2(2)? Or execve(2) >> triggering close-on-exec, etc... Besides, you are changing a user-visible >> behaviour here. Suppose your process forks and the child closes all >> descriptors; should that stop SIGIO delivery to the parent? >> >> Let's step back for a minute; could you describe how the userland is supposed >> to use that thing? > > MAP_DIRECT is a meant as a way to safely pass DAX mappings of a file > to the RDMA sub-system... Al, before you spend any more time thinking about this let me close with the RDMA folks on a notification scheme that works for them.
On Fri, 2017-10-13 at 08:56 -0700, Dan Williams wrote: > While implementing MAP_DIRECT, an mmap flag that arranges for an > FL_LAYOUT lease to be established, Al noted: > > You are not even guaranteed that descriptor will remain be still > open by the time you pass it down to your helper, nevermind the > moment when event actually happens... > > The first problem can be solved with an fd{get,put} at mmap > {entry,exit}. The second problem appears to be a general issue. > > Leases follow the lifetime of the inode, so it is possible for a lease > to be broken after the file is closed. When that happens userspace may > get a notification on a stale fd. Of course it is not recommended that a > process close a file descriptor with an active lease, but if it does we > should assume that the notification is not needed either. Walk leases at > close time and invalidate any pending fasync instances. > > Cc: Jeff Layton <jlayton@poochiereds.net> > Cc: "J. Bruce Fields" <bfields@fieldses.org> > Reported-by: Alexander Viro <viro@zeniv.linux.org.uk> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > fs/fcntl.c | 24 +++++++++++++++++++++++- > fs/file.c | 1 + > fs/locks.c | 22 ++++++++++++++++++++++ > include/linux/fs.h | 7 +++++++ > 4 files changed, 53 insertions(+), 1 deletion(-) > > diff --git a/fs/fcntl.c b/fs/fcntl.c > index 448a1119f0be..03612c363b90 100644 > --- a/fs/fcntl.c > +++ b/fs/fcntl.c > @@ -974,6 +974,28 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap > > EXPORT_SYMBOL(fasync_helper); > > +static void __fasync_silence(int fd, struct fasync_struct *fa) > +{ > + while (fa) { > + unsigned long flags; > + > + spin_lock_irqsave(&fa->fa_lock, flags); > + if (fa->fa_file && fa->fa_fd == fd) > + fa->fa_fd = -1; > + spin_unlock_irqrestore(&fa->fa_lock, flags); > + fa = rcu_dereference(fa->fa_next); > + } > +} > + > +void fasync_silence(int fd, struct fasync_struct **fp) > +{ > + if (*fp) { > + rcu_read_lock(); > + __fasync_silence(fd, *fp); > + rcu_read_unlock(); > + } > +} > + > /* > * rcu_read_lock() is held > */ > @@ -989,7 +1011,7 @@ static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band) > return; > } > spin_lock_irqsave(&fa->fa_lock, flags); > - if (fa->fa_file) { > + if (fa->fa_file && fa->fa_fd >= 0) { > fown = &fa->fa_file->f_owner; > /* Don't send SIGURG to processes which have not set a > queued signum: SIGURG has its own default signalling > diff --git a/fs/file.c b/fs/file.c > index 1fc7fbbb4510..b90969bf1f94 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -633,6 +633,7 @@ int __close_fd(struct files_struct *files, unsigned fd) > rcu_assign_pointer(fdt->fd[fd], NULL); > __clear_close_on_exec(fd, fdt); > __put_unused_fd(files, fd); > + locks_silence_lease(fd, file); > spin_unlock(&files->file_lock); > return filp_close(file, files); > > diff --git a/fs/locks.c b/fs/locks.c > index 1bd71c4d663a..ca93e4dbdd90 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -2573,6 +2573,28 @@ void locks_remove_file(struct file *filp) > spin_unlock(&ctx->flc_lock); > } > > +/* > + * The fd is assumed to valid, i.e. this routine is called in the > + * filp_close() path where the state of fd is known. > + */ > +void locks_silence_lease(int fd, struct file *filp) > +{ > + struct file_lock_context *ctx; > + struct file_lock *fl; > + > + ctx = smp_load_acquire(&locks_inode(filp)->i_flctx); > + if (!ctx || list_empty_careful(&ctx->flc_lease)) > + return; > + > + spin_lock(&ctx->flc_lock); > + list_for_each_entry(fl, &ctx->flc_lease, fl_list) { > + if (fl->fl_pid != current->tgid) > + continue; > + fasync_silence(fd, &fl->fl_fasync); > + } > + spin_unlock(&ctx->flc_lock); > +} > + > /** > * posix_unblock_lock - stop waiting for a file lock > * @waiter: the lock which was waiting > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 17e0e899e184..019853a7b2cd 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1076,6 +1076,7 @@ extern void locks_copy_lock(struct file_lock *, struct file_lock *); > extern void locks_copy_conflock(struct file_lock *, struct file_lock *); > extern void locks_remove_posix(struct file *, fl_owner_t); > extern void locks_remove_file(struct file *); > +extern void locks_silence_lease(int, struct file *); > extern void locks_release_private(struct file_lock *); > extern void posix_test_lock(struct file *, struct file_lock *); > extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *); > @@ -1153,6 +1154,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner) > return; > } > > +static inline void locks_silence_lease(int fd, struct file *filp) > +{ > + return; > +} > + > static inline void locks_remove_file(struct file *filp) > { > return; > @@ -1260,6 +1266,7 @@ extern struct fasync_struct *fasync_insert_entry(int, struct file *, struct fasy > extern int fasync_remove_entry(struct file *, struct fasync_struct **); > extern struct fasync_struct *fasync_alloc(void); > extern void fasync_free(struct fasync_struct *); > +extern void fasync_silence(int, struct fasync_struct **); > > /* can be called from interrupts */ > extern void kill_fasync(struct fasync_struct **, int, int); All remaning file leases associated with a particular struct file should be released at last fput via locks_remove_lease. So yes, you might get the SIGIO after you've already closed the file if there are lingering filp references out there. You might even get the signal after you've already recycled the the fd. That's potentially a real problem, IMO (and I guess it's the one you're wanting to address). FL_LAYOUT leases are not exposed to userland right now, so I think we can change the semantics there, as long as it doesn't break knfsd's use of them (and I wouldn't think that it would). FL_LEASE is more debatable, just because there are userland callers out there (as Al points out). Ok, just saw the note about waiting until you talk to the RDMA folks. Let us know if you want to look at this more closely.
diff --git a/fs/fcntl.c b/fs/fcntl.c index 448a1119f0be..03612c363b90 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -974,6 +974,28 @@ int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fap EXPORT_SYMBOL(fasync_helper); +static void __fasync_silence(int fd, struct fasync_struct *fa) +{ + while (fa) { + unsigned long flags; + + spin_lock_irqsave(&fa->fa_lock, flags); + if (fa->fa_file && fa->fa_fd == fd) + fa->fa_fd = -1; + spin_unlock_irqrestore(&fa->fa_lock, flags); + fa = rcu_dereference(fa->fa_next); + } +} + +void fasync_silence(int fd, struct fasync_struct **fp) +{ + if (*fp) { + rcu_read_lock(); + __fasync_silence(fd, *fp); + rcu_read_unlock(); + } +} + /* * rcu_read_lock() is held */ @@ -989,7 +1011,7 @@ static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band) return; } spin_lock_irqsave(&fa->fa_lock, flags); - if (fa->fa_file) { + if (fa->fa_file && fa->fa_fd >= 0) { fown = &fa->fa_file->f_owner; /* Don't send SIGURG to processes which have not set a queued signum: SIGURG has its own default signalling diff --git a/fs/file.c b/fs/file.c index 1fc7fbbb4510..b90969bf1f94 100644 --- a/fs/file.c +++ b/fs/file.c @@ -633,6 +633,7 @@ int __close_fd(struct files_struct *files, unsigned fd) rcu_assign_pointer(fdt->fd[fd], NULL); __clear_close_on_exec(fd, fdt); __put_unused_fd(files, fd); + locks_silence_lease(fd, file); spin_unlock(&files->file_lock); return filp_close(file, files); diff --git a/fs/locks.c b/fs/locks.c index 1bd71c4d663a..ca93e4dbdd90 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -2573,6 +2573,28 @@ void locks_remove_file(struct file *filp) spin_unlock(&ctx->flc_lock); } +/* + * The fd is assumed to valid, i.e. this routine is called in the + * filp_close() path where the state of fd is known. + */ +void locks_silence_lease(int fd, struct file *filp) +{ + struct file_lock_context *ctx; + struct file_lock *fl; + + ctx = smp_load_acquire(&locks_inode(filp)->i_flctx); + if (!ctx || list_empty_careful(&ctx->flc_lease)) + return; + + spin_lock(&ctx->flc_lock); + list_for_each_entry(fl, &ctx->flc_lease, fl_list) { + if (fl->fl_pid != current->tgid) + continue; + fasync_silence(fd, &fl->fl_fasync); + } + spin_unlock(&ctx->flc_lock); +} + /** * posix_unblock_lock - stop waiting for a file lock * @waiter: the lock which was waiting diff --git a/include/linux/fs.h b/include/linux/fs.h index 17e0e899e184..019853a7b2cd 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1076,6 +1076,7 @@ extern void locks_copy_lock(struct file_lock *, struct file_lock *); extern void locks_copy_conflock(struct file_lock *, struct file_lock *); extern void locks_remove_posix(struct file *, fl_owner_t); extern void locks_remove_file(struct file *); +extern void locks_silence_lease(int, struct file *); extern void locks_release_private(struct file_lock *); extern void posix_test_lock(struct file *, struct file_lock *); extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *); @@ -1153,6 +1154,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner) return; } +static inline void locks_silence_lease(int fd, struct file *filp) +{ + return; +} + static inline void locks_remove_file(struct file *filp) { return; @@ -1260,6 +1266,7 @@ extern struct fasync_struct *fasync_insert_entry(int, struct file *, struct fasy extern int fasync_remove_entry(struct file *, struct fasync_struct **); extern struct fasync_struct *fasync_alloc(void); extern void fasync_free(struct fasync_struct *); +extern void fasync_silence(int, struct fasync_struct **); /* can be called from interrupts */ extern void kill_fasync(struct fasync_struct **, int, int);
While implementing MAP_DIRECT, an mmap flag that arranges for an FL_LAYOUT lease to be established, Al noted: You are not even guaranteed that descriptor will remain be still open by the time you pass it down to your helper, nevermind the moment when event actually happens... The first problem can be solved with an fd{get,put} at mmap {entry,exit}. The second problem appears to be a general issue. Leases follow the lifetime of the inode, so it is possible for a lease to be broken after the file is closed. When that happens userspace may get a notification on a stale fd. Of course it is not recommended that a process close a file descriptor with an active lease, but if it does we should assume that the notification is not needed either. Walk leases at close time and invalidate any pending fasync instances. Cc: Jeff Layton <jlayton@poochiereds.net> Cc: "J. Bruce Fields" <bfields@fieldses.org> Reported-by: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/fcntl.c | 24 +++++++++++++++++++++++- fs/file.c | 1 + fs/locks.c | 22 ++++++++++++++++++++++ include/linux/fs.h | 7 +++++++ 4 files changed, 53 insertions(+), 1 deletion(-)