[v3,2/3] eventfd: add internal reference counting to fix notifier race conditions

eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
"release" callback.  This lets eventfd clients know if the eventfd is about
to go away and is very useful particularly for in-kernel clients.  However,
as it stands today it is not possible to use this feature of eventfd in a
race-free way.  This patch adds some additional logic to eventfd in order
to rectify this problem.

Background:
-----------------------
Eventfd currently only has one reference count mechanism: fget/fput.  This
in of itself is normally fine.  However, if a client expects to be
notified if the eventfd is closed, it cannot hold a fget() reference
itself or the underlying f_ops->release() callback will never be invoked
by VFS.  Therefore we have this somewhat unusual situation where we may
hold a pointer to an eventfd object (by virtue of having a waiter registered
in its wait-queue), but no reference.  To make matters more complicated,
the release callback is issued in an unlocked state.  This makes it nearly
impossible to design a mutual decoupling algorithm: you cannot unhook one
side from the other (or vice versa) without racing.

-----------------------

In summary, there are two fundamental problems:

1) The POLLHUP wakeup is broadcast lockless
2) There are no references to the wait-queue-head (embedded in eventfd_ctx)

We fix this by using the locked variant of wakeup for POLLHUP, and by
adding/exposing a kref to the underlying eventfd_ctx.  Clients should then
be able to govern their usage of the wait-queue as they do for any other
wait-queue in the kernel.

We propose this more raw solution rather than trying to encapsulate the
poll-callback because there are advantages to decoupling the
remove_wait_queue from the kref_put().  Namely, its nice to unhook the
wait-queue inside the wakeup, but to defer the kref_put() until we can
synchronize with the client.

Between these points, we believe we now have a race-free release
mechanism.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Davide Libenzi <davidel@xmailserver.org>
---

 fs/eventfd.c            |   43 ++++++++++++++++++++++++++++++++++++-------
 include/linux/eventfd.h |    7 +++++++
 2 files changed, 43 insertions(+), 7 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v3,2/3] eventfd: add internal reference counting to fix notifier race conditions

Commit Message

Patch