diff mbox

[04/23] vfs: Introduce infrastructure for revoking a file

Message ID 1243893048-17031-4-git-send-email-ebiederm@xmission.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Eric W. Biederman June 1, 2009, 9:50 p.m. UTC
From: Eric W. Biederman <ebiederm@xmission.com>

Introduce the file_hotplug_lock to protect file->f_op, file->private,
file->f_path from revoke operations.

The file_hotplug_lock is used typically as:
error = -EIO;
if (!file_hotplug_read_trylock(file))
	goto out;
....
file_hotplug_read_unlock(file);

In 5 subsystems sysfs, proc, and sysctl, tty, and sound we have support for
modifing a file descriptor so that the underlying object can go away.
In looking at the problem of pci hotunplug it appears that we
potentially need that support for all file descriptors except ones
talking to files on filesystems.  Even for file descriptors referring
to files, support for file the underlying object going away is
interesting for implementing features like umount -f and sys_revoke.

The implementations in sysfs, proc and sysctl are all very similar and
are composed of several components.
- A reference count to track that the file operations are being used.
- An ability to flag the file as no longer being valid.
- An ability to wait until the file operations are no longer being used.

In addition for a complete solution we need:
- A reliable way the file structures that we need to revoke.
- To wait for but not tamper with ongoing file creation and cleanup.
- A guarantee that all with user space controlled duration are removed.

The file_hotplug_lock has a very unique implementation necessitated by
the need to have no performance impact on existing code.  Classic locking
primitives and reference counting cause pipeline stalls, except for rcu
which provides no ability to preventing reading a data structure while
it is being updated.

file_hotplug_lock keeps the overhead extremely low by dedicating a
small amount of space in the task_struct to store the set of files
the task is currently in the process of using.

The revoke algorithm is simple:
- Find a file on the file_list.
   If it is dying or being created come back later
 * Take a reference to the file, ensuring it does not get freed while the
   revoke code accesses it.
 * Block out new usages of fields guarded by file_hotplug_lock.
 * Kick the underlying implemenation to wake up functions that are potentially
   blocked indefinitely.
 * Wait until there are no tasks holding file_hotplug_read_lock
 * Release the file specific data.
 * Drop the file ref count.
- Repeat until the file list is empty.

The implication of this implementation is that all revoked files will
behave exactly the same way, except for policy controlled by flags in
fmode.  The expected behaivor of revoked is close succeeds all other
operations return -EIO.  Except for the read on ttys this matches the
historical bsd behavior.

Approriate exports are present so modular character devices can
use the file_list

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
 Documentation/filesystems/vfs.txt |    5 +
 fs/Kconfig                        |    4 +
 fs/file_table.c                   |  166 ++++++++++++++++++++++++++++++++++--
 fs/open.c                         |    6 ++
 include/linux/fs.h                |   25 ++++++-
 include/linux/sched.h             |    7 ++
 6 files changed, 202 insertions(+), 11 deletions(-)

Comments

Pekka Enberg June 2, 2009, 5:16 a.m. UTC | #1
Hi Eric,

On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> +#ifdef CONFIG_FILE_HOTPLUG
> +
> +static bool file_in_use(struct file *file)
> +{
> +       struct task_struct *leader, *task;
> +       bool in_use = false;
> +       int i;
> +
> +       rcu_read_lock();
> +       do_each_thread(leader, task) {
> +               for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
> +                       if (task->file_hotplug_lock[i] == file) {
> +                               in_use = true;
> +                               goto found;
> +                       }
> +               }
> +       } while_each_thread(leader, task);
> +found:
> +       rcu_read_unlock();
> +       return in_use;
> +}

This seems rather heavy-weight. If we're going to use this
infrastructure for forced unmount, I think this will be a problem.

Can't we two this in two stages: (1) mark a bit that forces
file_hotplug_read_trylock to always fail and (2) block until the last
remaining in-kernel file_hotplug_read_unlock() has executed?

                        Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 2, 2009, 6:51 a.m. UTC | #2
Pekka Enberg <penberg@cs.helsinki.fi> writes:

> Hi Eric,
>
> On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> +#ifdef CONFIG_FILE_HOTPLUG
>> +
>> +static bool file_in_use(struct file *file)
>> +{
>> +       struct task_struct *leader, *task;
>> +       bool in_use = false;
>> +       int i;
>> +
>> +       rcu_read_lock();
>> +       do_each_thread(leader, task) {
>> +               for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
>> +                       if (task->file_hotplug_lock[i] == file) {
>> +                               in_use = true;
>> +                               goto found;
>> +                       }
>> +               }
>> +       } while_each_thread(leader, task);
>> +found:
>> +       rcu_read_unlock();
>> +       return in_use;
>> +}
>
> This seems rather heavy-weight. If we're going to use this
> infrastructure for forced unmount, I think this will be a problem.

> Can't we two this in two stages: (1) mark a bit that forces
> file_hotplug_read_trylock to always fail and (2) block until the last
> remaining in-kernel file_hotplug_read_unlock() has executed?

Yes there is room for more optimization in the slow path.
I haven't noticed being a problem yet so I figured I would start
with stupid and simple.

I can easily see two passes.  The first setting the flag an calling
f_op->dead.  The second some kind of consolidate walk through the task
list, allowing checking on multiple files at once.

I'm not ready to consider anything that will add cost to the fast
path in the file descriptors though.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg June 2, 2009, 7:08 a.m. UTC | #3
Hi Eric,

On Tue, Jun 2, 2009 at 9:51 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Pekka Enberg <penberg@cs.helsinki.fi> writes:
>
>> Hi Eric,
>>
>> On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> +#ifdef CONFIG_FILE_HOTPLUG
>>> +
>>> +static bool file_in_use(struct file *file)
>>> +{
>>> +       struct task_struct *leader, *task;
>>> +       bool in_use = false;
>>> +       int i;
>>> +
>>> +       rcu_read_lock();
>>> +       do_each_thread(leader, task) {
>>> +               for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
>>> +                       if (task->file_hotplug_lock[i] == file) {
>>> +                               in_use = true;
>>> +                               goto found;
>>> +                       }
>>> +               }
>>> +       } while_each_thread(leader, task);
>>> +found:
>>> +       rcu_read_unlock();
>>> +       return in_use;
>>> +}
>>
>> This seems rather heavy-weight. If we're going to use this
>> infrastructure for forced unmount, I think this will be a problem.
>
>> Can't we two this in two stages: (1) mark a bit that forces
>> file_hotplug_read_trylock to always fail and (2) block until the last
>> remaining in-kernel file_hotplug_read_unlock() has executed?
>
> Yes there is room for more optimization in the slow path.
> I haven't noticed being a problem yet so I figured I would start
> with stupid and simple.

Yup, just wanted to point it out.

On Tue, Jun 2, 2009 at 9:51 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> I can easily see two passes.  The first setting the flag an calling
> f_op->dead.  The second some kind of consolidate walk through the task
> list, allowing checking on multiple files at once.
>
> I'm not ready to consider anything that will add cost to the fast
> path in the file descriptors though.

Makes sense.

                        Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin June 2, 2009, 7:14 a.m. UTC | #4
On Mon, Jun 01, 2009 at 02:50:29PM -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <ebiederm@xmission.com>
> 
> Introduce the file_hotplug_lock to protect file->f_op, file->private,
> file->f_path from revoke operations.
> 
> The file_hotplug_lock is used typically as:
> error = -EIO;
> if (!file_hotplug_read_trylock(file))
> 	goto out;
> ....
> file_hotplug_read_unlock(file);

Why is it called hotplug? Does it have anything to do with hardware?
Because every concurrently changed software data structure in the
kernel can be "hot"-modified, right?

Wouldn't file_revoke_lock be more appropriate?


> In addition for a complete solution we need:
> - A reliable way the file structures that we need to revoke.
> - To wait for but not tamper with ongoing file creation and cleanup.
> - A guarantee that all with user space controlled duration are removed.
> 
> The file_hotplug_lock has a very unique implementation necessitated by
> the need to have no performance impact on existing code.  Classic locking

Well, it isn't no performance impact. Function calls, branches, icache
and dcache...

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds June 2, 2009, 5:06 p.m. UTC | #5
On Tue, 2 Jun 2009, Nick Piggin wrote:
>
> Why is it called hotplug? Does it have anything to do with hardware?
> Because every concurrently changed software data structure in the
> kernel can be "hot"-modified, right?
> 
> Wouldn't file_revoke_lock be more appropriate?

I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not 
"plug" and "unplug".

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 2, 2009, 8:52 p.m. UTC | #6
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, 2 Jun 2009, Nick Piggin wrote:
>>
>> Why is it called hotplug? Does it have anything to do with hardware?
>> Because every concurrently changed software data structure in the
>> kernel can be "hot"-modified, right?
>> 
>> Wouldn't file_revoke_lock be more appropriate?
>
> I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not 
> "plug" and "unplug".

I guess this shows my bias in triggering this code path from pci
hotunplug.  Instead of with some system call.

I'm not married to the name.  I wanted file_lock but that is already
used, and I did call the method revoke.

The only place where hotplug gives a useful hint is that it makes it
clear we really are disconnecting the file descriptor from what lies
below it.  We can't do some weird thing like keep the underlying object.
Because the underlying object is gone.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 2, 2009, 10:56 p.m. UTC | #7
Nick Piggin <npiggin@suse.de> writes:

>> In addition for a complete solution we need:
>> - A reliable way the file structures that we need to revoke.
>> - To wait for but not tamper with ongoing file creation and cleanup.
>> - A guarantee that all with user space controlled duration are removed.
>> 
>> The file_hotplug_lock has a very unique implementation necessitated by
>> the need to have no performance impact on existing code.  Classic locking
>
> Well, it isn't no performance impact. Function calls, branches, icache
> and dcache...

Practically none.

Everything I could measure was in the noise.  It is cheaper than any serializing
locking primitive.  I ran both lmbench and did some microbenchmark testing.
So I know on the fast path the overhead is minimal.  Certainly less than  what
we are doing in sysfs and proc today.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin June 3, 2009, 6:37 a.m. UTC | #8
On Tue, Jun 02, 2009 at 01:52:46PM -0700, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Tue, 2 Jun 2009, Nick Piggin wrote:
> >>
> >> Why is it called hotplug? Does it have anything to do with hardware?
> >> Because every concurrently changed software data structure in the
> >> kernel can be "hot"-modified, right?
> >> 
> >> Wouldn't file_revoke_lock be more appropriate?
> >
> > I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not 
> > "plug" and "unplug".
> 
> I guess this shows my bias in triggering this code path from pci
> hotunplug.  Instead of with some system call.
> 
> I'm not married to the name.  I wanted file_lock but that is already
> used, and I did call the method revoke.

Definitely it is not going to be called hotplug in the generic
vfs layer :)

 
> The only place where hotplug gives a useful hint is that it makes it
> clear we really are disconnecting the file descriptor from what lies
> below it.

Isn't that hotUNplug?

But anyway hot plug/unplug is a purely hardware concept. Revoke
for "unplug", please, including naming of patches, changelogs,
and locks etc.


>  We can't do some weird thing like keep the underlying object.
> Because the underlying object is gone.

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin June 3, 2009, 6:38 a.m. UTC | #9
On Tue, Jun 02, 2009 at 03:56:02PM -0700, Eric W. Biederman wrote:
> Nick Piggin <npiggin@suse.de> writes:
> 
> >> In addition for a complete solution we need:
> >> - A reliable way the file structures that we need to revoke.
> >> - To wait for but not tamper with ongoing file creation and cleanup.
> >> - A guarantee that all with user space controlled duration are removed.
> >> 
> >> The file_hotplug_lock has a very unique implementation necessitated by
> >> the need to have no performance impact on existing code.  Classic locking
> >
> > Well, it isn't no performance impact. Function calls, branches, icache
> > and dcache...
> 
> Practically none.

OK that's different from none. There is obviously overhead.

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi June 5, 2009, 9:03 a.m. UTC | #10
Hi Eric,

Very interesting work.

On Mon,  1 Jun 2009, Eric W. Biederman wrote:
> The file_hotplug_lock has a very unique implementation necessitated by
> the need to have no performance impact on existing code.  Classic locking
> primitives and reference counting cause pipeline stalls, except for rcu
> which provides no ability to preventing reading a data structure while
> it is being updated.

Well, the simple solution to that is to add another level of indirection:

old:

  fdtable -> file

new:

  fdtable -> persistent_file -> file

Then it is possible to replace persistent_file->file with a revoked
one under RCU.  This has the added advantage that it supports
arbitrary file replacements, not just ones which return EIO.

Another advantage is that dereferencing can normally be done "under
the hood" in fget()/fget_light().  Only code which wants to
permanently store a file pointer (like the SCM_RIGHTS thing) would
need to be aware of the extra complexity.

Would that work, do you think?

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 5, 2009, 7:06 p.m. UTC | #11
Miklos Szeredi <miklos@szeredi.hu> writes:

> Hi Eric,
>
> Very interesting work.
>
> On Mon,  1 Jun 2009, Eric W. Biederman wrote:
>> The file_hotplug_lock has a very unique implementation necessitated by
>> the need to have no performance impact on existing code.  Classic locking
>> primitives and reference counting cause pipeline stalls, except for rcu
>> which provides no ability to preventing reading a data structure while
>> it is being updated.
>
> Well, the simple solution to that is to add another level of indirection:
>
> old:
>
>   fdtable -> file
>
> new:
>
>   fdtable -> persistent_file -> file
>
> Then it is possible to replace persistent_file->file with a revoked
> one under RCU.  This has the added advantage that it supports
> arbitrary file replacements, not just ones which return EIO.
>
> Another advantage is that dereferencing can normally be done "under
> the hood" in fget()/fget_light().  Only code which wants to
> permanently store a file pointer (like the SCM_RIGHTS thing) would
> need to be aware of the extra complexity.
>
> Would that work, do you think?

Well I went down this path for a little while, and it has some good points.
Unfortunately it appears to be more costly.

fget() and friends are semantically very different my
file_hotplug_read_trylock and unlock.  In fact there is very little
overlap.  Which means that transparent to the vfs users doesn't
actually work.

We actually have more and less predictable places where we store files.

If there was actually a compelling case for being more general I would
certainly agree that splitting the file structure in two would be a
good deal.  As it is that level of flexibility seems to be overkill.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..d220fd5 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -806,6 +806,11 @@  otherwise noted.
   splice_read: called by the VFS to splice data from file to a pipe. This
 	       method is used by the splice(2) system call
 
+  dead: Called by the VFS to notify a file that it has been killed.
+	Typically this is used to wake up poll, read or other blocking
+	file methods, that could be indefinitely waiting for something
+	to happen.
+
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides. When opening a device node
 (character or block special) most filesystems will call special
diff --git a/fs/Kconfig b/fs/Kconfig
index 9f7270f..2fb86b0 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -265,4 +265,8 @@  endif
 source "fs/nls/Kconfig"
 source "fs/dlm/Kconfig"
 
+config FILE_HOTPLUG
+       bool
+       default n
+
 endmenu
diff --git a/fs/file_table.c b/fs/file_table.c
index 978f267..9db3031 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -23,6 +23,7 @@ 
 #include <linux/sysctl.h>
 #include <linux/percpu_counter.h>
 #include <linux/writeback.h>
+#include <linux/mm.h>
 
 #include <asm/atomic.h>
 
@@ -201,7 +202,7 @@  int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry,
 	file->f_path.dentry = dentry;
 	file->f_path.mnt = mntget(mnt);
 	file->f_mapping = dentry->d_inode->i_mapping;
-	file->f_mode = mode;
+	file->f_mode = mode | FMODE_OPENED;
 	file->f_op = fop;
 
 	/*
@@ -252,17 +253,12 @@  void drop_file_write_access(struct file *file)
 }
 EXPORT_SYMBOL_GPL(drop_file_write_access);
 
-/* __fput is called from task context when aio completion releases the last
- * last use of a struct file *.  Do not use otherwise.
- */
-void __fput(struct file *file)
+static void frelease(struct file *file)
 {
 	struct dentry *dentry = file->f_path.dentry;
 	struct vfsmount *mnt = file->f_path.mnt;
 	struct inode *inode = dentry->d_inode;
 
-	might_sleep();
-
 	fsnotify_close(file);
 	/*
 	 * The function eventpoll_release() should be the first called
@@ -277,23 +273,38 @@  void __fput(struct file *file)
 	}
 	if (file->f_op && file->f_op->release)
 		file->f_op->release(inode, file);
-	security_file_free(file);
 	ima_file_free(file);
 	if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
 		cdev_put(inode->i_cdev);
 	fops_put(file->f_op);
-	put_pid(file->f_owner.pid);
 	if (!special_file(inode->i_mode))
 		file_list_del(file, &inode->i_files);
 	if (file->f_mode & FMODE_WRITE)
 		drop_file_write_access(file);
 	file->f_path.dentry = NULL;
 	file->f_path.mnt = NULL;
-	file_free(file);
+	file->f_mapping = NULL;
+	file->f_op = NULL;
+	file->private_data = NULL;
 	dput(dentry);
 	mntput(mnt);
 }
 
+/* __fput is called from task context when aio completion releases the last
+ * last use of a struct file *.  Do not use otherwise.
+ */
+void __fput(struct file *file)
+{
+	might_sleep();
+
+	if (likely(!(file->f_mode & FMODE_DEAD)))
+		frelease(file);
+
+	security_file_free(file);
+	put_pid(file->f_owner.pid);
+	file_free(file);
+}
+
 struct file *fget(unsigned int fd)
 {
 	struct file *file;
@@ -360,6 +371,7 @@  void init_file_list(struct file_list *files)
 	INIT_LIST_HEAD(&files->list);
 	spin_lock_init(&files->lock);
 }
+EXPORT_SYMBOL(init_file_list);
 
 void file_list_add(struct file *file, struct file_list *files)
 {
@@ -377,6 +389,140 @@  void file_list_del(struct file *file, struct file_list *files)
 }
 EXPORT_SYMBOL(file_list_del);
 
+#ifdef CONFIG_FILE_HOTPLUG
+
+static bool file_in_use(struct file *file)
+{
+	struct task_struct *leader, *task;
+	bool in_use = false;
+	int i;
+
+	rcu_read_lock();
+	do_each_thread(leader, task) {
+		for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
+			if (task->file_hotplug_lock[i] == file) {
+				in_use = true;
+				goto found;
+			}
+		}
+	} while_each_thread(leader, task);
+found:
+	rcu_read_unlock();
+	return in_use;
+}
+
+static int revoke_file(struct file *file)
+{
+	/* Must be called with f_count held and FMODE_OPENED set */
+	fmode_t mode;
+
+	if (!(file->f_mode  & FMODE_REVOKE))
+		return -EINVAL;
+
+	/*
+	 * Tell everyone this file is dead.
+	 */
+	spin_lock(&file->f_ep_lock);
+	mode = file->f_mode;
+	file->f_mode |= FMODE_DEAD;
+	spin_unlock(&file->f_ep_lock);
+	if (mode & FMODE_DEAD)
+		return -EIO;
+
+	/*
+	 * Notify the file we have killed it.
+	 */
+	if (file->f_op->dead)
+		file->f_op->dead(file);
+
+	/*
+	 * Wait until there are no more callers in the file operations.
+	 */
+	if (file_in_use(file)) {
+		do {
+			schedule_timeout_uninterruptible(1);
+		} while (file_in_use(file));
+	}
+
+	revoke_file_mappings(file);
+	frelease(file);
+
+	return 0;
+}
+
+int revoke_file_list(struct file_list *files)
+{
+	struct file *file;
+	int error = 0;
+	int empty;
+
+restart:
+	file_list_lock(files);
+	list_for_each_entry(file, &files->list, f_u.fu_list) {
+
+		/* Don't touch files that have not yet been fully opened */
+		if (!(file->f_mode & FMODE_OPENED))
+			continue;
+
+		/* Ensure I am looking at the file after it was opened */
+		smp_rmb();
+
+		/* Don't touch files that are in the final stages of being closed. */
+		if (file_count(file) == 0)
+			continue;
+
+		/* Get a reference to the file */
+		if (!atomic_long_inc_not_zero(&file->f_count))
+			continue;
+
+		file_list_unlock(files);
+
+		error = revoke_file(file);
+		fput(file);
+
+		if (unlikely(error))
+			goto out;
+		goto restart;
+	}
+	empty = list_empty(&files->list);
+	file_list_unlock(files);
+	/*
+	 * If the file list had files we can't touch sleep a little while
+	 * and check again.
+	 */
+	if (!empty) {
+		schedule_timeout_uninterruptible(1);
+		goto restart;
+	}
+out:
+	return error;
+}
+EXPORT_SYMBOL(revoke_file_list);
+
+int __lockfunc file_hotplug_read_trylock(struct file *file)
+{
+	fmode_t mode = file->f_mode;
+	int locked = 0;
+	if (!(mode & FMODE_DEAD)) {
+		struct task_struct *tsk = current;
+		int pos = tsk->file_hotplug_lock_depth;
+		if (likely(pos < MAX_FILE_HOTPLUG_LOCK_DEPTH)) {
+			tsk->file_hotplug_lock_depth = pos + 1;
+			tsk->file_hotplug_lock[pos] = file;
+			locked = 1;
+		}
+	}
+	return locked;
+}
+
+void __lockfunc file_hotplug_read_unlock(struct file *file)
+{
+	struct task_struct *tsk = current;
+	tsk->file_hotplug_lock[--(tsk->file_hotplug_lock_depth)] = NULL;
+}
+
+#endif /* CONFIG_FILE_HOTPLUG */
+
 int fs_may_remount_ro(struct super_block *sb)
 {
 	struct inode *inode;
diff --git a/fs/open.c b/fs/open.c
index 20c3fc0..d0b2433 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -809,6 +809,7 @@  static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
 					const struct cred *cred)
 {
 	struct inode *inode;
+	fmode_t opened_fmode;
 	int error;
 
 	f->f_flags = flags;
@@ -857,6 +858,11 @@  static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
 		}
 	}
 
+	opened_fmode = f->f_mode | FMODE_OPENED;
+	/* Ensure revoke_file_list sees the opened file */
+	smp_wmb();
+	f->f_mode = opened_fmode;
+
 	return f;
 
 cleanup_all:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5329fd6..f7f4c46 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -87,6 +87,13 @@  struct inodes_stat_t {
  */
 #define FMODE_NOCMTIME		((__force fmode_t)2048)
 
+/* File has successfully been opened */
+#define FMODE_OPENED		((__force fmode_t)4096)
+/* File supports being revoked */
+#define FMODE_REVOKE		((__force fmode_t)8192)
+/* File is dead (has been revoked) */
+#define FMODE_DEAD		((__force fmode_t)16384)
+
 /*
  * The below are the various read and write types that we support. Some of
  * them include behavioral modifiers that send information down to the
@@ -903,6 +910,7 @@  static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 #define FILE_MNT_WRITE_RELEASED	2
 
 struct file {
+	/* file_hotplug_lock f_op, private, f_path, f_mapping */
 	/*
 	 * fu_list becomes invalid after file_free is called and queued via
 	 * fu_rcuhead for RCU freeing
@@ -935,12 +943,26 @@  struct file {
 	/* Used by fs/eventpoll.c to link all the hooks to this file */
 	struct list_head	f_ep_links;
 #endif /* #ifdef CONFIG_EPOLL */
-	struct address_space	*f_mapping;
+	struct address_space	*f_mapping; /* file_hotplug_lock or mmap_sem */
 #ifdef CONFIG_DEBUG_WRITECOUNT
 	unsigned long f_mnt_write_state;
 #endif
 };
 
+#ifdef CONFIG_FILE_HOTPLUG
+extern int file_hotplug_read_trylock(struct file *file);
+extern void file_hotplug_read_unlock(struct file *file);
+extern int revoke_file_list(struct file_list *files);
+#else
+static inline int file_hotplug_read_trylock(struct file *file)
+{
+	return 1;
+}
+static inline void file_hotplug_read_unlock(struct file *file)
+{
+}
+#endif
+
 static inline void file_list_lock(struct file_list *files)
 {
 	spin_lock(&files->lock);
@@ -1514,6 +1536,7 @@  struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
+	void (*dead)(struct file *);
 };
 
 struct inode_operations {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..bbf1616 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1302,6 +1302,13 @@  struct task_struct {
 	struct irqaction *irqaction;
 #endif
 
+/* File hotplug lock */
+#ifdef CONFIG_FILE_HOTPLUG
+#define MAX_FILE_HOTPLUG_LOCK_DEPTH 4U
+	int file_hotplug_lock_depth;
+	struct file *file_hotplug_lock[MAX_FILE_HOTPLUG_LOCK_DEPTH];
+#endif
+
 	/* Protection of the PI data structures: */
 	spinlock_t pi_lock;