diff mbox

[18/23] vfs: Teach epoll to use file_hotplug_lock

Message ID 1243893048-17031-18-git-send-email-ebiederm@xmission.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Eric W. Biederman June 1, 2009, 9:50 p.m. UTC
From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
 fs/eventpoll.c |   39 ++++++++++++++++++++++++++++++++-------
 1 files changed, 32 insertions(+), 7 deletions(-)

Comments

Davide Libenzi June 2, 2009, 4:51 p.m. UTC | #1
On Mon, 1 Jun 2009, Eric W. Biederman wrote:

> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
> 
> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
> ---
>  fs/eventpoll.c |   39 ++++++++++++++++++++++++++++++++-------
>  1 files changed, 32 insertions(+), 7 deletions(-)

This patchset gives me the willies for the amount of changes and possible 
impact on many subsystems.
Without having looked at the details, are you aware that epoll does not 
act like poll/select, and fds are not automatically removed (as in, 
dequeued from the poll wait queue) in any foreseeable amount of time after 
a POLLERR is received?
As far as the usespace API goes, they have the right to remain there.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 2, 2009, 9:23 p.m. UTC | #2
Davide Libenzi <davidel@xmailserver.org> writes:

> On Mon, 1 Jun 2009, Eric W. Biederman wrote:
>
>> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
>> 
>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>> ---
>>  fs/eventpoll.c |   39 ++++++++++++++++++++++++++++++++-------
>>  1 files changed, 32 insertions(+), 7 deletions(-)
>
> This patchset gives me the willies for the amount of changes and possible 
> impact on many subsystems.

It both is and is not that bad.  It is the cost of adding a lock.
For the VFS except for nfsd the I have touched everything that needs to be
touched.

Other subsystems that open read/write close files should be able to use
existing vfs helpers so they don't need to know about the new locking
explicitly.

Actually taking advantage of this infrastructure in a subsystem is
comparatively easy.  It took me about an hour to get uio using it.
That part is not deep by any means and is opt in.

> Without having looked at the details, are you aware that epoll does not 
> act like poll/select, and fds are not automatically removed (as in, 
> dequeued from the poll wait queue) in any foreseeable amount of time after 
> a POLLERR is received?

Yes I am aware of how epoll acts differently.

> As far as the usespace API goes, they have the right to remain there.

I absolutely agree.

Currently I have the code acting like close() with respect to epoll and
just having the file descriptor vanish at the end of the revoke.  While
we the revoke is in progress you get an EIO.

The file descriptor is not freed by a revoke operation so you can happily
hang unto it as long as you want.

I thought of doing something more uniform to user space.  But I observed
that the existing epoll punts on the case of a file descriptor being closed
and locking to go from a file to the other epoll datastructures is pretty
horrid I said forget it and used the existing close behaviour.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Davide Libenzi June 2, 2009, 9:52 p.m. UTC | #3
On Tue, 2 Jun 2009, Eric W. Biederman wrote:

> Davide Libenzi <davidel@xmailserver.org> writes:
> 
> > On Mon, 1 Jun 2009, Eric W. Biederman wrote:
> >
> >> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
> >> 
> >> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
> >> ---
> >>  fs/eventpoll.c |   39 ++++++++++++++++++++++++++++++++-------
> >>  1 files changed, 32 insertions(+), 7 deletions(-)
> >
> > This patchset gives me the willies for the amount of changes and possible 
> > impact on many subsystems.
> 
> It both is and is not that bad.  It is the cost of adding a lock.

We both know that it is not only the cost of a lock, but also the 
sprinkling over a pretty vast amount of subsystems, of another layer of 
code.



> I thought of doing something more uniform to user space.  But I observed
> that the existing epoll punts on the case of a file descriptor being closed
> and locking to go from a file to the other epoll datastructures is pretty
> horrid I said forget it and used the existing close behaviour.

Well, you cannot rely on the caller to tidy up the epoll fd by issuing an 
epoll_ctl(DEL), so you do *need* to "punt" on close in order to not leave 
lingering crap around. You cannot even hold a reference of the file, since 
otherwise the epoll hooking will have to trigger not only at ->release() 
time, but at every close, where you'll have to figure out if this is the 
last real userspace reference or not. Plus all the issues related to 
holding permanent extra references to userspace files.
And since a file can be added in many epoll devices, you need to 
unregister it from all of them (hence the other datastructures lookup). 
Better this, on the slow path, with locks acquired only in the epoll usage 
case, than some other thing and on the fast path, for every file.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 2, 2009, 10:51 p.m. UTC | #4
Davide Libenzi <davidel@xmailserver.org> writes:

> On Tue, 2 Jun 2009, Eric W. Biederman wrote:
>
>> Davide Libenzi <davidel@xmailserver.org> writes:
>> 
>> > On Mon, 1 Jun 2009, Eric W. Biederman wrote:
>> >
>> >> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
>> >> 
>> >> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>> >> ---
>> >>  fs/eventpoll.c |   39 ++++++++++++++++++++++++++++++++-------
>> >>  1 files changed, 32 insertions(+), 7 deletions(-)
>> >
>> > This patchset gives me the willies for the amount of changes and possible 
>> > impact on many subsystems.
>> 
>> It both is and is not that bad.  It is the cost of adding a lock.
>
> We both know that it is not only the cost of a lock, but also the 
> sprinkling over a pretty vast amount of subsystems, of another layer of 
> code.

I am not clear what problem you have.

Is it the sprinkling the code that takes and removes the lock?  Just
the VFS needs to be involved with that.  It is a slightly larger
surface area than doing the work inside the file operations as we
sometimes call the same method from 3-4 different places but it is
definitely a bounded problem.

Is it putting in the handful lines per subsystem to actually use this
functionality?  At that level something generic that is maintained
outside of the subsystem is better than the mess we have with 4-5
different implementations in the subsystems that need it, each having
a different assortment of bugs.

>> I thought of doing something more uniform to user space.  But I observed
>> that the existing epoll punts on the case of a file descriptor being closed
>> and locking to go from a file to the other epoll datastructures is pretty
>> horrid I said forget it and used the existing close behaviour.
>
> Well, you cannot rely on the caller to tidy up the epoll fd by issuing an 
> epoll_ctl(DEL), so you do *need* to "punt" on close in order to not leave 
> lingering crap around. You cannot even hold a reference of the file, since 
> otherwise the epoll hooking will have to trigger not only at ->release() 
> time, but at every close, where you'll have to figure out if this is the 
> last real userspace reference or not. Plus all the issues related to 
> holding permanent extra references to userspace files.
> And since a file can be added in many epoll devices, you need to 
> unregister it from all of them (hence the other datastructures lookup). 
> Better this, on the slow path, with locks acquired only in the epoll usage 
> case, than some other thing and on the fast path, for every file.

Sure, and that is largely and I am preserving those semantics.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Davide Libenzi June 3, 2009, 2:57 p.m. UTC | #5
On Tue, 2 Jun 2009, Eric W. Biederman wrote:

> I am not clear what problem you have.
> 
> Is it the sprinkling the code that takes and removes the lock?  Just
> the VFS needs to be involved with that.  It is a slightly larger
> surface area than doing the work inside the file operations as we
> sometimes call the same method from 3-4 different places but it is
> definitely a bounded problem.
> 
> Is it putting in the handful lines per subsystem to actually use this
> functionality?  At that level something generic that is maintained
> outside of the subsystem is better than the mess we have with 4-5
> different implementations in the subsystems that need it, each having
> a different assortment of bugs.

Come on, only in the open fast path, there are at least two spin 
lock/unlock and two atomic ops. Without even starting to count all the 
extra branches and software added.
Is this stuff *really* needed, or we can faitly happily live w/out?


- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 3, 2009, 8:53 p.m. UTC | #6
Davide Libenzi <davidel@xmailserver.org> writes:

> On Tue, 2 Jun 2009, Eric W. Biederman wrote:
>
>> I am not clear what problem you have.
>> 
>> Is it the sprinkling the code that takes and removes the lock?  Just
>> the VFS needs to be involved with that.  It is a slightly larger
>> surface area than doing the work inside the file operations as we
>> sometimes call the same method from 3-4 different places but it is
>> definitely a bounded problem.
>> 
>> Is it putting in the handful lines per subsystem to actually use this
>> functionality?  At that level something generic that is maintained
>> outside of the subsystem is better than the mess we have with 4-5
>> different implementations in the subsystems that need it, each having
>> a different assortment of bugs.
>
> Come on, only in the open fast path, there are at least two spin 
> lock/unlock and two atomic ops. Without even starting to count all the 
> extra branches and software added.
> Is this stuff *really* needed, or we can faitly happily live w/out?

????

What code are you talking about?

To the open path a few memory writes and a smp_wmb.  No atomics and no
spin lock/unlocks.

Are you complaining because I retain the file_list?

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Davide Libenzi June 4, 2009, 12:50 a.m. UTC | #7
On Wed, 3 Jun 2009, Eric W. Biederman wrote:

> What code are you talking about?
> 
> To the open path a few memory writes and a smp_wmb.  No atomics and no
> spin lock/unlocks.
> 
> Are you complaining because I retain the file_list?

Sorry, did I overlook the patch? Weren't a couple of atomic ops and a spin 
lock/unlock couple present in __dentry_open() (same sort of the release 
path)?
And that's only like 5% of the code touched by the new special handling of 
the file operations structure (basically, every f_op access ends up being 
wrapped by two atomic ops and other extra code).
The question, that I'd like to reiterate is, is this stuff really needed?
Anyway, my complaint ends here and I'll let others evaluate if merging 
this patchset is worth the cost.


- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 4, 2009, 1:42 a.m. UTC | #8
Davide Libenzi <davidel@xmailserver.org> writes:

> On Wed, 3 Jun 2009, Eric W. Biederman wrote:
>
>> What code are you talking about?
>> 
>> To the open path a few memory writes and a smp_wmb.  No atomics and no
>> spin lock/unlocks.
>> 
>> Are you complaining because I retain the file_list?
>
> Sorry, did I overlook the patch? Weren't a couple of atomic ops and a spin 
> lock/unlock couple present in __dentry_open() (same sort of the release 
> path)?

You might be remembering v1.  In v2 I have operations like file_hotplug_read_trylock
that implement a lock but use an rcu like algorithm.  So there are no atomic
operations involved with their associated pipeline stalls.  Over my previous
version this made a reasonable performance benefit.

> And that's only like 5% of the code touched by the new special handling of 
> the file operations structure (basically, every f_op access ends up being 
> wrapped by two atomic ops and other extra code).

Yes there is a single extra wrapping of every file in the syscall path.  So
we know that someone is using it. 

> The question, that I'd like to reiterate is, is this stuff really needed?
> Anyway, my complaint ends here and I'll let others evaluate if merging 
> this patchset is worth the cost.

Sure.  My apologies for not answering that question earlier.

My perspective is that every subsystem that winds up supporting hotplug
hardware winds up rolling it's own version of something like this,
and they each have a different set of bugs.

So one generic version is definitely worth implementing.

Similarly there is a case for a generic revoke facility in the kernel.
Alan at least has made the case that there are certain security problems
that can not be solved in userspace without revoke.

From an implementation point of view doing the generic implementation at
the vfs level has significant benefits.

The extra locking appears reasonable from a code maintenance and
comprehensibility point of view.  A real pain to find all of the entry
points into the vfs, and get other code to use the right vfs helpers
they should always have been using but I am volunteering to do that
work.

The practical question I see is are the performance overheads of my
primitives low enough that I do not cause performance regressions
on anyone's fast path.  As far as I have been able to measure is that 
the performance overhead is low enough, because I have been able to
avoid the use of atomics and have been able to use fairly small code
with predictable branches.  Which is why I pressed you to be certain
I understood where you are coming from.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a89f370..eabb167 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -627,8 +627,13 @@  static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
 	struct epitem *epi, *tmp;
 
 	list_for_each_entry_safe(epi, tmp, head, rdllink) {
-		if (epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
-		    epi->event.events)
+		int events = DEAD_POLLMASK;
+		
+		if (file_hotplug_read_trylock(epi->ffd.file)) {
+			events = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);
+			file_hotplug_read_unlock(epi->ffd.file);
+		}
+		if (events & epi->event.events)
 			return POLLIN | POLLRDNORM;
 		else {
 			/*
@@ -1060,8 +1065,12 @@  static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
 
 		list_del_init(&epi->rdllink);
 
-		revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
-			epi->event.events;
+		revents = DEAD_POLLMASK;
+		if (file_hotplug_read_trylock(epi->ffd.file)) {
+			revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);
+			file_hotplug_read_unlock(epi->ffd.file);
+		}
+		revents &= epi->event.events;
 
 		/*
 		 * If the event mask intersect the caller-requested one,
@@ -1248,10 +1257,17 @@  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	if (!tfile)
 		goto error_fput;
 
+	error = -EIO;
+	if (!file_hotplug_read_trylock(file))
+		goto error_tgt_fput;
+
+	if (!file_hotplug_read_trylock(tfile))
+		goto error_file_unlock;
+
 	/* The target file descriptor must support poll */
 	error = -EPERM;
 	if (!tfile->f_op || !tfile->f_op->poll)
-		goto error_tgt_fput;
+		goto error_tgt_unlock;
 
 	/*
 	 * We have to check that the file structure underneath the file descriptor
@@ -1260,7 +1276,7 @@  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	 */
 	error = -EINVAL;
 	if (file == tfile || !is_file_epoll(file))
-		goto error_tgt_fput;
+		goto error_tgt_unlock;
 
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
@@ -1302,6 +1318,10 @@  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	}
 	mutex_unlock(&ep->mtx);
 
+error_tgt_unlock:
+	file_hotplug_read_unlock(tfile);
+error_file_unlock:
+	file_hotplug_read_unlock(file);
 error_tgt_fput:
 	fput(tfile);
 error_fput:
@@ -1338,13 +1358,16 @@  SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
 	if (!file)
 		goto error_return;
 
+	error = -EIO;
+	if (!file_hotplug_read_trylock(file))
+		goto error_fput;
 	/*
 	 * We have to check that the file structure underneath the fd
 	 * the user passed to us _is_ an eventpoll file.
 	 */
 	error = -EINVAL;
 	if (!is_file_epoll(file))
-		goto error_fput;
+		goto error_unlock;
 
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
@@ -1355,6 +1378,8 @@  SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
 	/* Time to fish for events ... */
 	error = ep_poll(ep, events, maxevents, timeout);
 
+error_unlock:
+	file_hotplug_read_unlock(file);
 error_fput:
 	fput(file);
 error_return: