diff mbox series

[1/2] sysctl: read() must consume poll events, not poll()

Message ID 20220502140602.130373-1-Jason@zx2c4.com (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show
Series [1/2] sysctl: read() must consume poll events, not poll() | expand

Commit Message

Jason A. Donenfeld May 2, 2022, 2:06 p.m. UTC
Events that poll() responds to are supposed to be consumed when the file
is read(), not by the poll() itself. By putting it on the poll() itself,
it makes it impossible to poll() on a epoll file descriptor, since the
event gets consumed too early. Jann wrote a PoC, available in the link
below.

Reported-by: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 fs/proc/proc_sysctl.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Comments

Jason A. Donenfeld May 2, 2022, 3:30 p.m. UTC | #1
+Lennart, since systemd is the only userspace I know of currently making
use of this.

On Mon, May 02, 2022 at 04:06:01PM +0200, Jason A. Donenfeld wrote:
> Events that poll() responds to are supposed to be consumed when the file
> is read(), not by the poll() itself. By putting it on the poll() itself,
> it makes it impossible to poll() on a epoll file descriptor, since the
> event gets consumed too early. Jann wrote a PoC, available in the link
> below.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Luis Chamberlain <mcgrof@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Link: https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
>  fs/proc/proc_sysctl.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index 7d9cfc730bd4..1aa145794207 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -622,6 +622,14 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
>  
>  static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter)
>  {
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct ctl_table_header *head = grab_header(inode);
> +	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
> +
> +	if (!IS_ERR(head) && table->poll)
> +		iocb->ki_filp->private_data = proc_sys_poll_event(table->poll);
> +	sysctl_head_finish(head);
> +
>  	return proc_sys_call_handler(iocb, iter, 0);
>  }
>  
> @@ -668,10 +676,8 @@ static __poll_t proc_sys_poll(struct file *filp, poll_table *wait)
>  	event = (unsigned long)filp->private_data;
>  	poll_wait(filp, &table->poll->wait, wait);
>  
> -	if (event != atomic_read(&table->poll->event)) {
> -		filp->private_data = proc_sys_poll_event(table->poll);
> +	if (event != atomic_read(&table->poll->event))
>  		ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI;
> -	}
>  
>  out:
>  	sysctl_head_finish(head);
> -- 
> 2.35.1

Just wanted to double check with you that this change wouldn't break how
you're using it in systemd for /proc/sys/kernel/hostname:

https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832
https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465

I couldn't find anybody else actually polling on it. Interestingly, it
looks like sd_event_add_io uses epoll() inside, but you're not hitting
the bug that Jann pointed out (because I suppose you're not poll()ing on
an epoll fd).

Jason
Lennart Poettering May 2, 2022, 3:43 p.m. UTC | #2
On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote:

> Just wanted to double check with you that this change wouldn't break how
> you're using it in systemd for /proc/sys/kernel/hostname:
>
> https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832
> https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465
>
> I couldn't find anybody else actually polling on it. Interestingly, it
> looks like sd_event_add_io uses epoll() inside, but you're not hitting
> the bug that Jann pointed out (because I suppose you're not poll()ing on
> an epoll fd).

Well, if you made sure this still works, I am fine either way ;-)

Lennart

--
Lennart Poettering, Berlin
Jason A. Donenfeld May 3, 2022, 11:27 a.m. UTC | #3
On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote:
> On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
> 
> > Just wanted to double check with you that this change wouldn't break how
> > you're using it in systemd for /proc/sys/kernel/hostname:
> >
> > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832
> > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465
> >
> > I couldn't find anybody else actually polling on it. Interestingly, it
> > looks like sd_event_add_io uses epoll() inside, but you're not hitting
> > the bug that Jann pointed out (because I suppose you're not poll()ing on
> > an epoll fd).
> 
> Well, if you made sure this still works, I am fine either way ;-)

Actually... ugh. It doesn't work. systemd uses uname() to read the host
name, and doesn't actually read() the file descriptor after receiving
the poll event on it. So I guess I'll forget this, and maybe we'll have
to live with sysctl's poll() being broken. :(

Jason
Luis Chamberlain May 12, 2022, 5:40 p.m. UTC | #4
On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote:
> On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote:
> > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
> > 
> > > Just wanted to double check with you that this change wouldn't break how
> > > you're using it in systemd for /proc/sys/kernel/hostname:
> > >
> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832
> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465
> > >
> > > I couldn't find anybody else actually polling on it. Interestingly, it
> > > looks like sd_event_add_io uses epoll() inside, but you're not hitting
> > > the bug that Jann pointed out (because I suppose you're not poll()ing on
> > > an epoll fd).
> > 
> > Well, if you made sure this still works, I am fine either way ;-)
> 
> Actually... ugh. It doesn't work. systemd uses uname() to read the host
> name, and doesn't actually read() the file descriptor after receiving
> the poll event on it. So I guess I'll forget this, and maybe we'll have
> to live with sysctl's poll() being broken. :(

A kconfig option may let you do what you want, and allow older kernels
to not break, however I am more curious how sysctl's approach to poll
went unnnoticed for so long. But also, I'm curious if it was based on
another poll implementation which may have been busted.

But more importantly, how do we avoid this in the future?

  Luis
Lucas De Marchi May 12, 2022, 6:22 p.m. UTC | #5
On Mon, May 02, 2022 at 04:06:01PM +0200, Jason A. Donenfeld wrote:
>Events that poll() responds to are supposed to be consumed when the file
>is read(), not by the poll() itself. By putting it on the poll() itself,
>it makes it impossible to poll() on a epoll file descriptor, since the
>event gets consumed too early. Jann wrote a PoC, available in the link
>below.
>
>Reported-by: Jann Horn <jannh@google.com>
>Cc: Kees Cook <keescook@chromium.org>
>Cc: Luis Chamberlain <mcgrof@kernel.org>
>Cc: linux-fsdevel@vger.kernel.org
>Link: https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/
>Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

It seems to be my bug. This is indeed better. Also, I don't think it's unsafe
to fix it like this neither. If my memory serves (it's what, 10+ years?), this
was only tested and used with poll(), which will continue to work.

There were plans to use it in one of systemd's tools, in which case we'd
probably notice the misbehavior with epoll().... humn, checking now systemd's
codebase:

static int on_hostname_change(sd_event_source *es, int fd, uint32_t revents, void *userdata) {
	...
	log_info("System hostname changed to '%s'.", full_hostname);
	...
}

static int manager_watch_hostname(Manager *m) {
         int r;

         assert(m);

         m->hostname_fd = open("/proc/sys/kernel/hostname",
                               O_RDONLY|O_CLOEXEC|O_NONBLOCK|O_NOCTTY);
         if (m->hostname_fd < 0) {
                 log_warning_errno(errno, "Failed to watch hostname: %m");
                 return 0;
         }

         r = sd_event_add_io(m->event, &m->hostname_event_source, m->hostname_fd, 0, on_hostname_change, m);
         if (r < 0) {
                 if (r == -EPERM)
                         /* kernels prior to 3.2 don't support polling this file. Ignore the failure. */
                         m->hostname_fd = safe_close(m->hostname_fd);
                 else
                         return log_error_errno(r, "Failed to add hostname event source: %m");
         }
	....
}

and sd_event library uses epoll. So, it's apparently not working and it doesn't
seem to be their intention to rely on the misbehavior. This makes me think it
even deserves a Cc to stable.

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>


Lucas De Marchi

>---
> fs/proc/proc_sysctl.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
>diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
>index 7d9cfc730bd4..1aa145794207 100644
>--- a/fs/proc/proc_sysctl.c
>+++ b/fs/proc/proc_sysctl.c
>@@ -622,6 +622,14 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
>
> static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter)
> {
>+	struct inode *inode = file_inode(iocb->ki_filp);
>+	struct ctl_table_header *head = grab_header(inode);
>+	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
>+
>+	if (!IS_ERR(head) && table->poll)
>+		iocb->ki_filp->private_data = proc_sys_poll_event(table->poll);
>+	sysctl_head_finish(head);
>+
> 	return proc_sys_call_handler(iocb, iter, 0);
> }
>
>@@ -668,10 +676,8 @@ static __poll_t proc_sys_poll(struct file *filp, poll_table *wait)
> 	event = (unsigned long)filp->private_data;
> 	poll_wait(filp, &table->poll->wait, wait);
>
>-	if (event != atomic_read(&table->poll->event)) {
>-		filp->private_data = proc_sys_poll_event(table->poll);
>+	if (event != atomic_read(&table->poll->event))
> 		ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI;
>-	}
>
> out:
> 	sysctl_head_finish(head);
>-- 
>2.35.1
>
Jason A. Donenfeld May 12, 2022, 6:27 p.m. UTC | #6
Hi Lucas,

On 5/12/22, Lucas De Marchi <lucas.demarchi@intel.com> wrote:
> On Mon, May 02, 2022 at 04:06:01PM +0200, Jason A. Donenfeld wrote:
>>Events that poll() responds to are supposed to be consumed when the file
>>is read(), not by the poll() itself. By putting it on the poll() itself,
>>it makes it impossible to poll() on a epoll file descriptor, since the
>>event gets consumed too early. Jann wrote a PoC, available in the link
>>below.
>>
>>Reported-by: Jann Horn <jannh@google.com>
>>Cc: Kees Cook <keescook@chromium.org>
>>Cc: Luis Chamberlain <mcgrof@kernel.org>
>>Cc: linux-fsdevel@vger.kernel.org
>>Link:
>> https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/
>>Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>
> It seems to be my bug. This is indeed better. Also, I don't think it's
> unsafe
> to fix it like this neither. If my memory serves (it's what, 10+ years?),
> this
> was only tested and used with poll(), which will continue to work.

You are not correct. Please read the entire thread. This breaks systemd.

Jason
Eric W. Biederman May 12, 2022, 6:29 p.m. UTC | #7
Luis Chamberlain <mcgrof@kernel.org> writes:

> On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote:
>> On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote:
>> > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
>> > 
>> > > Just wanted to double check with you that this change wouldn't break how
>> > > you're using it in systemd for /proc/sys/kernel/hostname:
>> > >
>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832
>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465
>> > >
>> > > I couldn't find anybody else actually polling on it. Interestingly, it
>> > > looks like sd_event_add_io uses epoll() inside, but you're not hitting
>> > > the bug that Jann pointed out (because I suppose you're not poll()ing on
>> > > an epoll fd).
>> > 
>> > Well, if you made sure this still works, I am fine either way ;-)
>> 
>> Actually... ugh. It doesn't work. systemd uses uname() to read the host
>> name, and doesn't actually read() the file descriptor after receiving
>> the poll event on it. So I guess I'll forget this, and maybe we'll have
>> to live with sysctl's poll() being broken. :(

We should be able to modify calling uname() to act the same as reading
the file descriptor. 

> A kconfig option may let you do what you want, and allow older kernels
> to not break, however I am more curious how sysctl's approach to poll
> went unnnoticed for so long. But also, I'm curious if it was based on
> another poll implementation which may have been busted.
>
> But more importantly, how do we avoid this in the future?

Poll on files is weird and generally doesn't work (because files are
always read to read or write).  What did we do to make it work on these
sysctl files?

Eric
Jason A. Donenfeld May 12, 2022, 6:32 p.m. UTC | #8
Hi Eric,

On 5/12/22, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Luis Chamberlain <mcgrof@kernel.org> writes:
>
>> On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote:
>>> On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote:
>>> > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
>>> >
>>> > > Just wanted to double check with you that this change wouldn't break
>>> > > how
>>> > > you're using it in systemd for /proc/sys/kernel/hostname:
>>> > >
>>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832
>>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465
>>> > >
>>> > > I couldn't find anybody else actually polling on it. Interestingly,
>>> > > it
>>> > > looks like sd_event_add_io uses epoll() inside, but you're not
>>> > > hitting
>>> > > the bug that Jann pointed out (because I suppose you're not poll()ing
>>> > > on
>>> > > an epoll fd).
>>> >
>>> > Well, if you made sure this still works, I am fine either way ;-)
>>>
>>> Actually... ugh. It doesn't work. systemd uses uname() to read the host
>>> name, and doesn't actually read() the file descriptor after receiving
>>> the poll event on it. So I guess I'll forget this, and maybe we'll have
>>> to live with sysctl's poll() being broken. :(
>
> We should be able to modify calling uname() to act the same as reading
> the file descriptor.

How? That sounds like madness. read() takes a fd. uname() doesn't. Are
you proposing we walk through the fds of the process calling uname()
til we find a matching one and then twiddle it's private context
state? I mean I guess that'd work, but...

Jason
diff mbox series

Patch

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 7d9cfc730bd4..1aa145794207 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -622,6 +622,14 @@  static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
 
 static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter)
 {
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct ctl_table_header *head = grab_header(inode);
+	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
+
+	if (!IS_ERR(head) && table->poll)
+		iocb->ki_filp->private_data = proc_sys_poll_event(table->poll);
+	sysctl_head_finish(head);
+
 	return proc_sys_call_handler(iocb, iter, 0);
 }
 
@@ -668,10 +676,8 @@  static __poll_t proc_sys_poll(struct file *filp, poll_table *wait)
 	event = (unsigned long)filp->private_data;
 	poll_wait(filp, &table->poll->wait, wait);
 
-	if (event != atomic_read(&table->poll->event)) {
-		filp->private_data = proc_sys_poll_event(table->poll);
+	if (event != atomic_read(&table->poll->event))
 		ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI;
-	}
 
 out:
 	sysctl_head_finish(head);