Message ID | 20220502140602.130373-1-Jason@zx2c4.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | [1/2] sysctl: read() must consume poll events, not poll() | expand |
+Lennart, since systemd is the only userspace I know of currently making use of this. On Mon, May 02, 2022 at 04:06:01PM +0200, Jason A. Donenfeld wrote: > Events that poll() responds to are supposed to be consumed when the file > is read(), not by the poll() itself. By putting it on the poll() itself, > it makes it impossible to poll() on a epoll file descriptor, since the > event gets consumed too early. Jann wrote a PoC, available in the link > below. > > Reported-by: Jann Horn <jannh@google.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: Luis Chamberlain <mcgrof@kernel.org> > Cc: linux-fsdevel@vger.kernel.org > Link: https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/ > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > --- > fs/proc/proc_sysctl.c | 12 +++++++++--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c > index 7d9cfc730bd4..1aa145794207 100644 > --- a/fs/proc/proc_sysctl.c > +++ b/fs/proc/proc_sysctl.c > @@ -622,6 +622,14 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > > static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter) > { > + struct inode *inode = file_inode(iocb->ki_filp); > + struct ctl_table_header *head = grab_header(inode); > + struct ctl_table *table = PROC_I(inode)->sysctl_entry; > + > + if (!IS_ERR(head) && table->poll) > + iocb->ki_filp->private_data = proc_sys_poll_event(table->poll); > + sysctl_head_finish(head); > + > return proc_sys_call_handler(iocb, iter, 0); > } > > @@ -668,10 +676,8 @@ static __poll_t proc_sys_poll(struct file *filp, poll_table *wait) > event = (unsigned long)filp->private_data; > poll_wait(filp, &table->poll->wait, wait); > > - if (event != atomic_read(&table->poll->event)) { > - filp->private_data = proc_sys_poll_event(table->poll); > + if (event != atomic_read(&table->poll->event)) > ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI; > - } > > out: > sysctl_head_finish(head); > -- > 2.35.1 Just wanted to double check with you that this change wouldn't break how you're using it in systemd for /proc/sys/kernel/hostname: https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832 https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465 I couldn't find anybody else actually polling on it. Interestingly, it looks like sd_event_add_io uses epoll() inside, but you're not hitting the bug that Jann pointed out (because I suppose you're not poll()ing on an epoll fd). Jason
On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > Just wanted to double check with you that this change wouldn't break how > you're using it in systemd for /proc/sys/kernel/hostname: > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832 > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465 > > I couldn't find anybody else actually polling on it. Interestingly, it > looks like sd_event_add_io uses epoll() inside, but you're not hitting > the bug that Jann pointed out (because I suppose you're not poll()ing on > an epoll fd). Well, if you made sure this still works, I am fine either way ;-) Lennart -- Lennart Poettering, Berlin
On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote: > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > > > Just wanted to double check with you that this change wouldn't break how > > you're using it in systemd for /proc/sys/kernel/hostname: > > > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832 > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465 > > > > I couldn't find anybody else actually polling on it. Interestingly, it > > looks like sd_event_add_io uses epoll() inside, but you're not hitting > > the bug that Jann pointed out (because I suppose you're not poll()ing on > > an epoll fd). > > Well, if you made sure this still works, I am fine either way ;-) Actually... ugh. It doesn't work. systemd uses uname() to read the host name, and doesn't actually read() the file descriptor after receiving the poll event on it. So I guess I'll forget this, and maybe we'll have to live with sysctl's poll() being broken. :( Jason
On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote: > On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote: > > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > > > > > Just wanted to double check with you that this change wouldn't break how > > > you're using it in systemd for /proc/sys/kernel/hostname: > > > > > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832 > > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465 > > > > > > I couldn't find anybody else actually polling on it. Interestingly, it > > > looks like sd_event_add_io uses epoll() inside, but you're not hitting > > > the bug that Jann pointed out (because I suppose you're not poll()ing on > > > an epoll fd). > > > > Well, if you made sure this still works, I am fine either way ;-) > > Actually... ugh. It doesn't work. systemd uses uname() to read the host > name, and doesn't actually read() the file descriptor after receiving > the poll event on it. So I guess I'll forget this, and maybe we'll have > to live with sysctl's poll() being broken. :( A kconfig option may let you do what you want, and allow older kernels to not break, however I am more curious how sysctl's approach to poll went unnnoticed for so long. But also, I'm curious if it was based on another poll implementation which may have been busted. But more importantly, how do we avoid this in the future? Luis
On Mon, May 02, 2022 at 04:06:01PM +0200, Jason A. Donenfeld wrote: >Events that poll() responds to are supposed to be consumed when the file >is read(), not by the poll() itself. By putting it on the poll() itself, >it makes it impossible to poll() on a epoll file descriptor, since the >event gets consumed too early. Jann wrote a PoC, available in the link >below. > >Reported-by: Jann Horn <jannh@google.com> >Cc: Kees Cook <keescook@chromium.org> >Cc: Luis Chamberlain <mcgrof@kernel.org> >Cc: linux-fsdevel@vger.kernel.org >Link: https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/ >Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> It seems to be my bug. This is indeed better. Also, I don't think it's unsafe to fix it like this neither. If my memory serves (it's what, 10+ years?), this was only tested and used with poll(), which will continue to work. There were plans to use it in one of systemd's tools, in which case we'd probably notice the misbehavior with epoll().... humn, checking now systemd's codebase: static int on_hostname_change(sd_event_source *es, int fd, uint32_t revents, void *userdata) { ... log_info("System hostname changed to '%s'.", full_hostname); ... } static int manager_watch_hostname(Manager *m) { int r; assert(m); m->hostname_fd = open("/proc/sys/kernel/hostname", O_RDONLY|O_CLOEXEC|O_NONBLOCK|O_NOCTTY); if (m->hostname_fd < 0) { log_warning_errno(errno, "Failed to watch hostname: %m"); return 0; } r = sd_event_add_io(m->event, &m->hostname_event_source, m->hostname_fd, 0, on_hostname_change, m); if (r < 0) { if (r == -EPERM) /* kernels prior to 3.2 don't support polling this file. Ignore the failure. */ m->hostname_fd = safe_close(m->hostname_fd); else return log_error_errno(r, "Failed to add hostname event source: %m"); } .... } and sd_event library uses epoll. So, it's apparently not working and it doesn't seem to be their intention to rely on the misbehavior. This makes me think it even deserves a Cc to stable. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Lucas De Marchi >--- > fs/proc/proc_sysctl.c | 12 +++++++++--- > 1 file changed, 9 insertions(+), 3 deletions(-) > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c >index 7d9cfc730bd4..1aa145794207 100644 >--- a/fs/proc/proc_sysctl.c >+++ b/fs/proc/proc_sysctl.c >@@ -622,6 +622,14 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > > static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter) > { >+ struct inode *inode = file_inode(iocb->ki_filp); >+ struct ctl_table_header *head = grab_header(inode); >+ struct ctl_table *table = PROC_I(inode)->sysctl_entry; >+ >+ if (!IS_ERR(head) && table->poll) >+ iocb->ki_filp->private_data = proc_sys_poll_event(table->poll); >+ sysctl_head_finish(head); >+ > return proc_sys_call_handler(iocb, iter, 0); > } > >@@ -668,10 +676,8 @@ static __poll_t proc_sys_poll(struct file *filp, poll_table *wait) > event = (unsigned long)filp->private_data; > poll_wait(filp, &table->poll->wait, wait); > >- if (event != atomic_read(&table->poll->event)) { >- filp->private_data = proc_sys_poll_event(table->poll); >+ if (event != atomic_read(&table->poll->event)) > ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI; >- } > > out: > sysctl_head_finish(head); >-- >2.35.1 >
Hi Lucas, On 5/12/22, Lucas De Marchi <lucas.demarchi@intel.com> wrote: > On Mon, May 02, 2022 at 04:06:01PM +0200, Jason A. Donenfeld wrote: >>Events that poll() responds to are supposed to be consumed when the file >>is read(), not by the poll() itself. By putting it on the poll() itself, >>it makes it impossible to poll() on a epoll file descriptor, since the >>event gets consumed too early. Jann wrote a PoC, available in the link >>below. >> >>Reported-by: Jann Horn <jannh@google.com> >>Cc: Kees Cook <keescook@chromium.org> >>Cc: Luis Chamberlain <mcgrof@kernel.org> >>Cc: linux-fsdevel@vger.kernel.org >>Link: >> https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/ >>Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > It seems to be my bug. This is indeed better. Also, I don't think it's > unsafe > to fix it like this neither. If my memory serves (it's what, 10+ years?), > this > was only tested and used with poll(), which will continue to work. You are not correct. Please read the entire thread. This breaks systemd. Jason
Luis Chamberlain <mcgrof@kernel.org> writes: > On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote: >> On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote: >> > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote: >> > >> > > Just wanted to double check with you that this change wouldn't break how >> > > you're using it in systemd for /proc/sys/kernel/hostname: >> > > >> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832 >> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465 >> > > >> > > I couldn't find anybody else actually polling on it. Interestingly, it >> > > looks like sd_event_add_io uses epoll() inside, but you're not hitting >> > > the bug that Jann pointed out (because I suppose you're not poll()ing on >> > > an epoll fd). >> > >> > Well, if you made sure this still works, I am fine either way ;-) >> >> Actually... ugh. It doesn't work. systemd uses uname() to read the host >> name, and doesn't actually read() the file descriptor after receiving >> the poll event on it. So I guess I'll forget this, and maybe we'll have >> to live with sysctl's poll() being broken. :( We should be able to modify calling uname() to act the same as reading the file descriptor. > A kconfig option may let you do what you want, and allow older kernels > to not break, however I am more curious how sysctl's approach to poll > went unnnoticed for so long. But also, I'm curious if it was based on > another poll implementation which may have been busted. > > But more importantly, how do we avoid this in the future? Poll on files is weird and generally doesn't work (because files are always read to read or write). What did we do to make it work on these sysctl files? Eric
Hi Eric, On 5/12/22, Eric W. Biederman <ebiederm@xmission.com> wrote: > Luis Chamberlain <mcgrof@kernel.org> writes: > >> On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote: >>> On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote: >>> > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote: >>> > >>> > > Just wanted to double check with you that this change wouldn't break >>> > > how >>> > > you're using it in systemd for /proc/sys/kernel/hostname: >>> > > >>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832 >>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465 >>> > > >>> > > I couldn't find anybody else actually polling on it. Interestingly, >>> > > it >>> > > looks like sd_event_add_io uses epoll() inside, but you're not >>> > > hitting >>> > > the bug that Jann pointed out (because I suppose you're not poll()ing >>> > > on >>> > > an epoll fd). >>> > >>> > Well, if you made sure this still works, I am fine either way ;-) >>> >>> Actually... ugh. It doesn't work. systemd uses uname() to read the host >>> name, and doesn't actually read() the file descriptor after receiving >>> the poll event on it. So I guess I'll forget this, and maybe we'll have >>> to live with sysctl's poll() being broken. :( > > We should be able to modify calling uname() to act the same as reading > the file descriptor. How? That sounds like madness. read() takes a fd. uname() doesn't. Are you proposing we walk through the fds of the process calling uname() til we find a matching one and then twiddle it's private context state? I mean I guess that'd work, but... Jason
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 7d9cfc730bd4..1aa145794207 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -622,6 +622,14 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter) { + struct inode *inode = file_inode(iocb->ki_filp); + struct ctl_table_header *head = grab_header(inode); + struct ctl_table *table = PROC_I(inode)->sysctl_entry; + + if (!IS_ERR(head) && table->poll) + iocb->ki_filp->private_data = proc_sys_poll_event(table->poll); + sysctl_head_finish(head); + return proc_sys_call_handler(iocb, iter, 0); } @@ -668,10 +676,8 @@ static __poll_t proc_sys_poll(struct file *filp, poll_table *wait) event = (unsigned long)filp->private_data; poll_wait(filp, &table->poll->wait, wait); - if (event != atomic_read(&table->poll->event)) { - filp->private_data = proc_sys_poll_event(table->poll); + if (event != atomic_read(&table->poll->event)) ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI; - } out: sysctl_head_finish(head);
Events that poll() responds to are supposed to be consumed when the file is read(), not by the poll() itself. By putting it on the poll() itself, it makes it impossible to poll() on a epoll file descriptor, since the event gets consumed too early. Jann wrote a PoC, available in the link below. Reported-by: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/ Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- fs/proc/proc_sysctl.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)