diff mbox series

[2/2] aio: fix use-after-free due to missing POLLFREE handling

Message ID 20211204002301.116139-3-ebiggers@kernel.org (mailing list archive)
State New, archived
Headers show
Series aio poll: fix use-after-free and missing wakeups | expand

Commit Message

Eric Biggers Dec. 4, 2021, 12:23 a.m. UTC
From: Eric Biggers <ebiggers@google.com>

signalfd_poll() and binder_poll() are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case.  This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution.  This solution is for the queue to be cleared
before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.

Unfortunately, only eventpoll handles POLLFREE.  A second type of
non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't
handle POLLFREE.  This allows a use-after-free to occur if a signalfd or
binder fd is polled with aio poll, and the waitqueue gets freed.

Fix this by making aio poll handle POLLFREE.

A patch by Ramji Jiyani <ramjiyani@google.com>
(https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com)
tried to do this by making aio_poll_wake() always complete the request
inline if POLLFREE is seen.  However, that solution had two bugs.
First, it introduced a deadlock, as it unconditionally locked the aio
context while holding the waitqueue lock, which inverts the normal
locking order.  Second, it didn't consider that POLLFREE notifications
are missed while the request has been temporarily de-queued.

The second problem was solved by my previous patch.  This patch then
properly fixes the use-after-free by handling POLLFREE in a
deadlock-free way.  It does this by taking advantage of the fact that
freeing of the waitqueue is RCU-delayed, similar to what eventpoll does.

Fixes: 2c14fa838cbe ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.18+
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/aio.c                        | 102 ++++++++++++++++++++++++--------
 include/uapi/asm-generic/poll.h |   2 +-
 2 files changed, 79 insertions(+), 25 deletions(-)

Comments

Linus Torvalds Dec. 6, 2021, 7:28 p.m. UTC | #1
On Fri, Dec 3, 2021 at 4:23 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> require another solution.  This solution is for the queue to be cleared
> before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.

Ugh.

I hate POLLFREE, and the more I look at this, the more I think it's broken.

And that

        wake_up_poll(wq, EPOLLHUP | POLLFREE);

in particular looks broken - the intent is that it should remove all
the wait queue entries (because the wait queue head is going away),
but wake_up_poll() iself actually does

        __wake_up(x, TASK_NORMAL, 1, poll_to_key(m))

where that '1' is the number of exclusive entries it will wake up.

So if there are two exclusive waiters, wake_up_poll() will simply stop
waking things up after the first one.

Which defeats the whole POLLFREE thing too.

Maybe I'm missing something, but POLLFREE really is broken.

I'd argue that all of epoll() is broken, but I guess we're stuck with it.

Now, it's very possible that nobody actually uses exclusive waits for
those wait queues, and my "nr_exclusive" argument is about something
that isn't actually a bug in reality. But I think it's a sign of
confusion, and it's just another issue with POLLFREE.

I really wish we could have some way to not have epoll and aio mess
with the wait-queue lists and cache the wait queue head pointers that
they don't own.

In the meantime, I don't think these patches make things worse, and
they may fix things. But see above about "nr_exclusive" and how I
think wait queue entries might end up avoiding POLLFREE handling..

                  Linus
Eric Biggers Dec. 6, 2021, 7:54 p.m. UTC | #2
On Mon, Dec 06, 2021 at 11:28:13AM -0800, Linus Torvalds wrote:
> On Fri, Dec 3, 2021 at 4:23 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > require another solution.  This solution is for the queue to be cleared
> > before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.
> 
> Ugh.
> 
> I hate POLLFREE, and the more I look at this, the more I think it's broken.
> 
> And that
> 
>         wake_up_poll(wq, EPOLLHUP | POLLFREE);
> 
> in particular looks broken - the intent is that it should remove all
> the wait queue entries (because the wait queue head is going away),
> but wake_up_poll() iself actually does
> 
>         __wake_up(x, TASK_NORMAL, 1, poll_to_key(m))
> 
> where that '1' is the number of exclusive entries it will wake up.
> 
> So if there are two exclusive waiters, wake_up_poll() will simply stop
> waking things up after the first one.
> 
> Which defeats the whole POLLFREE thing too.
> 
> Maybe I'm missing something, but POLLFREE really is broken.
> 
> I'd argue that all of epoll() is broken, but I guess we're stuck with it.
> 
> Now, it's very possible that nobody actually uses exclusive waits for
> those wait queues, and my "nr_exclusive" argument is about something
> that isn't actually a bug in reality. But I think it's a sign of
> confusion, and it's just another issue with POLLFREE.
> 
> I really wish we could have some way to not have epoll and aio mess
> with the wait-queue lists and cache the wait queue head pointers that
> they don't own.
> 
> In the meantime, I don't think these patches make things worse, and
> they may fix things. But see above about "nr_exclusive" and how I
> think wait queue entries might end up avoiding POLLFREE handling..
> 
>                   Linus

epoll supports exclusive waits, via the EPOLLEXCLUSIVE flag.  So this looks like
a real problem.

It could be fixed by converting signalfd and binder to use something like this,
right?

	#define wake_up_pollfree(x)  \
	       __wake_up(x, TASK_NORMAL, 0, poll_to_key(EPOLLHUP | POLLFREE))

As for eliminating POLLFREE entirely, that would require that the waitqueue
heads be moved to a location which has a longer lifetime.  I'm not sure if
that's possible.  In the case of signalfd, maybe the waitqueue head could be
moved to the file private data (signalfd_ctx), and then sighand_struct would
contain a list of signalfd_ctx's which are receiving signals directed to that
sighand_struct, rather than the waitqueue head itself.  I'm not sure how well
that would work.  This would probably change user-visible behavior; if a
signalfd is inherited by fork(), the child process would be notified about
signals sent to the parent process, rather than itself as is currently the case.

- Eric
Linus Torvalds Dec. 6, 2021, 9:48 p.m. UTC | #3
On Mon, Dec 6, 2021 at 11:54 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> It could be fixed by converting signalfd and binder to use something like this,
> right?
>
>         #define wake_up_pollfree(x)  \
>                __wake_up(x, TASK_NORMAL, 0, poll_to_key(EPOLLHUP | POLLFREE))

Yeah, that looks better to me.

That said, maybe it would be even better to then make it more explicit
about what it does, and not make it look like just another wakeup with
an odd parameter.

IOW, maybe that "pollfree()" function itself could very much do the
waitqueue entry removal on each entry using list_del_init(), and not
expect the wakeup code for the entry to do so.

I think that kind of explicit "this removes all entries from the wait
list head" is better than "let's do a wakeup and expect all entries to
magically implicitly remove themselves".

After all, that "implicitly remove themselves" was what didn't happen,
and caused the bug in the first place.

And all the normal cases, that don't care about POLLFREE at all,
because their waitqueues don't go away from under them, wouldn't care,
because "list_del_init()" still  leaves a valid self-pointing list in
place, so if they do list_del() afterwards, nothing happens.

I dunno. But yes, that wake_up_pollfree() of yours certainly looks
better than what we have now.

> As for eliminating POLLFREE entirely, that would require that the waitqueue
> heads be moved to a location which has a longer lifetime.

Yeah, the problem with aio and epoll is exactly that they end up using
waitqueue heads without knowing what they are.

I'm not at all convinced that there aren't other situations where the
thing the waitqueue head is embedded might not have other lifetimes.

The *common* situation is obviously that it's associated with a file,
and the file pointer ends up holding the reference to whatever device
or something (global list in a loadable module, or whatever) it is.

Hmm. The poll_wait() callback function actually does get the 'struct
file *' that the wait is associated with. I wonder if epoll queueing
could actually increment the file ref when it creates its own wait
entry, and release it at ep_remove_wait_queue()?

Maybe epoll could avoid having to remove entries entirely that way -
simply by virtue of having a ref to the files - and remove the need
for having the ->whead pointer entirely (and remove the need for
POLLFREE handling)?

And maybe the signalfd case can do the same - instead of expecting
exit() to clean up the list when sighand->count goes to zero, maybe
the signalfd filp can just hold a ref to that 'struct sighand_struct',
and it gets free'd whenever there are no signalfd users left?

That would involve making signalfd_ctx actually tied to one particular
'struct signal', but that might be the right thing to do regardless
(instead of making it always act on 'current' like it does now).

So maybe with some re-organization, we could get rid of the need for
POLLFREE entirely.. Anybody?

But your patches are certainly simpler in that they just fix the status quo.

             Linus
Eric Biggers Dec. 7, 2021, 10:21 a.m. UTC | #4
On Mon, Dec 06, 2021 at 01:48:04PM -0800, Linus Torvalds wrote:
> On Mon, Dec 6, 2021 at 11:54 AM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > It could be fixed by converting signalfd and binder to use something like this,
> > right?
> >
> >         #define wake_up_pollfree(x)  \
> >                __wake_up(x, TASK_NORMAL, 0, poll_to_key(EPOLLHUP | POLLFREE))
> 
> Yeah, that looks better to me.
> 
> That said, maybe it would be even better to then make it more explicit
> about what it does, and not make it look like just another wakeup with
> an odd parameter.
> 
> IOW, maybe that "pollfree()" function itself could very much do the
> waitqueue entry removal on each entry using list_del_init(), and not
> expect the wakeup code for the entry to do so.
> 
> I think that kind of explicit "this removes all entries from the wait
> list head" is better than "let's do a wakeup and expect all entries to
> magically implicitly remove themselves".
> 
> After all, that "implicitly remove themselves" was what didn't happen,
> and caused the bug in the first place.
> 
> And all the normal cases, that don't care about POLLFREE at all,
> because their waitqueues don't go away from under them, wouldn't care,
> because "list_del_init()" still  leaves a valid self-pointing list in
> place, so if they do list_del() afterwards, nothing happens.
> 
> I dunno. But yes, that wake_up_pollfree() of yours certainly looks
> better than what we have now.

In v2 (https://lore.kernel.org/r/20211207095726.169766-1-ebiggers@kernel.org/),
I did end up making wake_up_pollfree() into a function, which calls __wake_up()
and also checks that the queue became empty.

However, I didn't make wake_up_pollfree() remove the queue entries itself.  I
don't think that would be better, given that the purpose of POLLFREE is to
provide a notification that the queue is going away, not simply to remove the
queue entries.  In particular, we might want to cancel the poll request(s);
that's what aio poll does.  Also, we need to synchronize with other places where
the waitqueue is accessed, e.g. to remove an entry which requires taking the
waitqueue lock (see either aio poll or epoll).

> > As for eliminating POLLFREE entirely, that would require that the waitqueue
> > heads be moved to a location which has a longer lifetime.
> 
> Yeah, the problem with aio and epoll is exactly that they end up using
> waitqueue heads without knowing what they are.
> 
> I'm not at all convinced that there aren't other situations where the
> thing the waitqueue head is embedded might not have other lifetimes.
> 
> The *common* situation is obviously that it's associated with a file,
> and the file pointer ends up holding the reference to whatever device
> or something (global list in a loadable module, or whatever) it is.
> 
> Hmm. The poll_wait() callback function actually does get the 'struct
> file *' that the wait is associated with. I wonder if epoll queueing
> could actually increment the file ref when it creates its own wait
> entry, and release it at ep_remove_wait_queue()?
> 
> Maybe epoll could avoid having to remove entries entirely that way -
> simply by virtue of having a ref to the files - and remove the need
> for having the ->whead pointer entirely (and remove the need for
> POLLFREE handling)?

Well, the documented user-visible behavior of epoll is that when a file is
closed, it is automatically removed from all epoll instances.  That precludes
epoll instances from holding references to the files which they are polling.

And the reason that POLLFREE needs to exist is that holding a file reference
isn't enough anyway, since in the case of signalfd and binder the waitqueue may
have a shorter lifetime.  If the file reference was enough, then aio poll
wouldn't need to handle POLLFREE.

> 
> And maybe the signalfd case can do the same - instead of expecting
> exit() to clean up the list when sighand->count goes to zero, maybe
> the signalfd filp can just hold a ref to that 'struct sighand_struct',
> and it gets free'd whenever there are no signalfd users left?
> 
> That would involve making signalfd_ctx actually tied to one particular
> 'struct signal', but that might be the right thing to do regardless
> (instead of making it always act on 'current' like it does now).
> 
> So maybe with some re-organization, we could get rid of the need for
> POLLFREE entirely.. Anybody?
> 
> But your patches are certainly simpler in that they just fix the status quo.

That would be really nice, but unfortunately the current behavior is documented
in signalfd(2), for example:

"""
   fork(2) semantics
       After  a  fork(2),  the  child inherits a copy of the signalfd file de‐
       scriptor.  A read(2) from the file descriptor in the child will  return
       information about signals queued to the child.
"""

With your proposed change to signalfd, the child would receive its parent's
signals via the signalfd, rather than its own signals.

It's maybe how signalfd should have worked originally.  I don't know whether it
is actually safe to change or not, but the fact that the current behavior is
specifically documented in the man page isn't too promising.

Also, changing signalfd by itself wouldn't allow removal of POLLFREE; binder
would have to be changed too.  I'm not a binder expert, but fixing binder
doesn't look super promising.  It looks pretty intentional that binder_poll()
operates on per-thread state, while the binder file description itself is a
per-process thing.

- Eric
diff mbox series

Patch

diff --git a/fs/aio.c b/fs/aio.c
index b72beeaed0631..8daafb95b6801 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1620,6 +1620,50 @@  static void aio_poll_put_work(struct work_struct *work)
 	iocb_put(iocb);
 }
 
+/*
+ * Safely lock the waitqueue which the request is on, taking into account the
+ * fact that the waitqueue may be freed at any time before the lock is taken.
+ *
+ * Returns true on success, meaning that req->head->lock was locked and
+ * req->wait is on req->head.  Returns false if the request has already been
+ * removed from its waitqueue (which might no longer exist).
+ */
+static bool poll_iocb_lock_wq(struct poll_iocb *req)
+{
+	/*
+	 * Holding req->head->lock prevents req->wait being removed from the
+	 * waitqueue, and thus prevents the waitqueue from being freed
+	 * (POLLFREE) in the rare cases of ->poll() implementations where the
+	 * waitqueue may not live as long as the struct file.  However, taking
+	 * req->head->lock in the first place can race with POLLFREE.
+	 *
+	 * We solve this in a similar way as eventpoll does: by taking advantage
+	 * of the fact that all users of POLLFREE will RCU-delay the actual
+	 * free.  If we enter rcu_read_lock() and check that req->wait isn't
+	 * already removed, then req->head is guaranteed to be accessible until
+	 * rcu_read_unlock().  We can then lock req->head->lock and re-check
+	 * whether req->wait has been removed or not.
+	 */
+	rcu_read_lock();
+	if (list_empty_careful(&req->wait.entry)) {
+		rcu_read_unlock();
+		return false;
+	}
+	spin_lock(&req->head->lock);
+	if (list_empty(&req->wait.entry)) {
+		spin_unlock(&req->head->lock);
+		rcu_read_unlock();
+		return false;
+	}
+	rcu_read_unlock();
+	return true;
+}
+
+static void poll_iocb_unlock_wq(struct poll_iocb *req)
+{
+	spin_unlock(&req->head->lock);
+}
+
 static void aio_poll_complete_work(struct work_struct *work)
 {
 	struct poll_iocb *req = container_of(work, struct poll_iocb, work);
@@ -1639,21 +1683,22 @@  static void aio_poll_complete_work(struct work_struct *work)
 	 * avoid further branches in the fast path.
 	 */
 	spin_lock_irq(&ctx->ctx_lock);
-	spin_lock(&req->head->lock);
-	if (!mask && !READ_ONCE(req->cancelled)) {
-		/* Reschedule completion if another wakeup came in. */
-		if (req->work_need_resched) {
-			schedule_work(&req->work);
-			req->work_need_resched = false;
-		} else {
-			req->work_scheduled = false;
+	if (poll_iocb_lock_wq(req)) {
+		if (!mask && !READ_ONCE(req->cancelled)) {
+			/* Reschedule completion if another wakeup came in. */
+			if (req->work_need_resched) {
+				schedule_work(&req->work);
+				req->work_need_resched = false;
+			} else {
+				req->work_scheduled = false;
+			}
+			poll_iocb_unlock_wq(req);
+			spin_unlock_irq(&ctx->ctx_lock);
+			return;
 		}
-		spin_unlock(&req->head->lock);
-		spin_unlock_irq(&ctx->ctx_lock);
-		return;
-	}
-	list_del_init(&req->wait.entry);
-	spin_unlock(&req->head->lock);
+		list_del_init(&req->wait.entry);
+		poll_iocb_unlock_wq(req);
+	} /* else, POLLFREE has freed the waitqueue, so we must complete */
 	list_del_init(&iocb->ki_list);
 	iocb->ki_res.res = mangle_poll(mask);
 	spin_unlock_irq(&ctx->ctx_lock);
@@ -1667,13 +1712,12 @@  static int aio_poll_cancel(struct kiocb *iocb)
 	struct aio_kiocb *aiocb = container_of(iocb, struct aio_kiocb, rw);
 	struct poll_iocb *req = &aiocb->poll;
 
-	spin_lock(&req->head->lock);
-	WRITE_ONCE(req->cancelled, true);
-	if (!list_empty(&req->wait.entry)) {
+	if (poll_iocb_lock_wq(req)) {
+		WRITE_ONCE(req->cancelled, true);
 		list_del_init(&req->wait.entry);
-		schedule_work(&aiocb->poll.work);
-	}
-	spin_unlock(&req->head->lock);
+		schedule_work(&req->work);
+		poll_iocb_unlock_wq(req);
+	} /* else, the request was force-cancelled by POLLFREE already */
 
 	return 0;
 }
@@ -1717,6 +1761,14 @@  static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 		if (iocb)
 			iocb_put(iocb);
 	} else {
+		if (mask & POLLFREE) {
+			/*
+			 * The waitqueue is going to be freed, so remove the
+			 * request from the waitqueue and mark it as cancelled.
+			 */
+			list_del_init(&req->wait.entry);
+			WRITE_ONCE(req->cancelled, true);
+		}
 		/*
 		 * Schedule the completion work if needed.  If it was already
 		 * scheduled, record that another wakeup came in.
@@ -1789,8 +1841,9 @@  static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 	mask = vfs_poll(req->file, &apt.pt) & req->events;
 	spin_lock_irq(&ctx->ctx_lock);
 	if (likely(req->head)) {
-		spin_lock(&req->head->lock);
-		if (req->work_scheduled) {
+		bool on_queue = poll_iocb_lock_wq(req);
+
+		if (!on_queue || req->work_scheduled) {
 			/* async completion work was already scheduled */
 			if (apt.error)
 				cancel = true;
@@ -1803,12 +1856,13 @@  static int aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 		} else if (cancel) {
 			/* cancel if possible (may be too late though) */
 			WRITE_ONCE(req->cancelled, true);
-		} else if (!list_empty(&req->wait.entry)) {
+		} else if (on_queue) {
 			/* actually waiting for an event */
 			list_add_tail(&aiocb->ki_list, &ctx->active_reqs);
 			aiocb->ki_cancel = aio_poll_cancel;
 		}
-		spin_unlock(&req->head->lock);
+		if (on_queue)
+			poll_iocb_unlock_wq(req);
 	}
 	if (mask) { /* no async, we'd stolen it */
 		aiocb->ki_res.res = mangle_poll(mask);
diff --git a/include/uapi/asm-generic/poll.h b/include/uapi/asm-generic/poll.h
index 41b509f410bf9..f9c520ce4bf4e 100644
--- a/include/uapi/asm-generic/poll.h
+++ b/include/uapi/asm-generic/poll.h
@@ -29,7 +29,7 @@ 
 #define POLLRDHUP       0x2000
 #endif
 
-#define POLLFREE	(__force __poll_t)0x4000	/* currently only for epoll */
+#define POLLFREE	(__force __poll_t)0x4000
 
 #define POLL_BUSY_LOOP	(__force __poll_t)0x8000