diff mbox series

[RERESEND,10/11] splice: file->pipe: -EINVAL for non-regular files w/o FMODE_NOWAIT

Message ID 25974c79b84c0b3aad566ff7c33b082f90ac5f17e.1697486714.git.nabijaczleweli@nabijaczleweli.xyz (mailing list archive)
State New
Headers show
Series splice(file<>pipe) I/O on file as-if O_NONBLOCK | expand

Commit Message

Ahelenia Ziemiańska Dec. 14, 2023, 6:45 p.m. UTC
We request non-blocking I/O in the generic implementation, but some
files ‒ ttys ‒ only check O_NONBLOCK. Refuse them here, lest we
risk sleeping with the pipe locked for indeterminate lengths of
time.

This also masks inconsistent wake-ups (usually every second line)
when splicing from ttys in icanon mode.

Regular files don't /have/ a distinct O_NONBLOCK mode,
because they always behave non-blockingly, and for them FMODE_NOWAIT is
used in the purest sense of
  /* File is capable of returning -EAGAIN if I/O will block */
which is not set by the vast majority of filesystems,
and it's not the semantic we want here.

Link: https://lore.kernel.org/linux-fsdevel/qk6hjuam54khlaikf2ssom6custxf5is2ekkaequf4hvode3ls@zgf7j5j4ubvw/t/#u
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
---
 fs/splice.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Jens Axboe Dec. 15, 2023, 3:47 p.m. UTC | #1
On 12/14/23 11:45 AM, Ahelenia Ziemia?ska wrote:
> We request non-blocking I/O in the generic implementation, but some
> files ? ttys ? only check O_NONBLOCK. Refuse them here, lest we
> risk sleeping with the pipe locked for indeterminate lengths of
> time.

A worthy goal here is ensuring that _everybody_ honors IOCB_NOWAIT,
rather than just rely on O_NONBLOCK. This does involve converting to
->read_iter/->write_iter if the driver isn't already using it, but some
of them already have that, yet don't check IOCB_NOWAIT or treat it the
same as O_NONBLOCK.

Adding special checks like this is not a good idea, imho.

> This also masks inconsistent wake-ups (usually every second line)
> when splicing from ttys in icanon mode.
> 
> Regular files don't /have/ a distinct O_NONBLOCK mode,
> because they always behave non-blockingly, and for them FMODE_NOWAIT is
> used in the purest sense of
>   /* File is capable of returning -EAGAIN if I/O will block */
> which is not set by the vast majority of filesystems,
> and it's not the semantic we want here.

The main file systems do very much set it, like btrfs, ext4, and xfs. If
you look at total_file_systems / ones_flagging_it the ratio may be high,
but in terms of installed userbase, the majority definitely will have
it. Also see comment on cover letter for addressing this IOCB_NOWAIT
confusion.
Ahelenia Ziemiańska Dec. 16, 2023, 5:36 a.m. UTC | #2
On Fri, Dec 15, 2023 at 08:47:15AM -0700, Jens Axboe wrote:
> On 12/14/23 11:45 AM, Ahelenia Ziemiańska wrote:
> > We request non-blocking I/O in the generic implementation, but some
> > files ‒ ttys ‒ only check O_NONBLOCK. Refuse them here, lest we
> > risk sleeping with the pipe locked for indeterminate lengths of
> > time.
> A worthy goal here is ensuring that _everybody_ honors IOCB_NOWAIT,
> rather than just rely on O_NONBLOCK. This does involve converting to
> ->read_iter/->write_iter if the driver isn't already using it, but some
> of them already have that, yet don't check IOCB_NOWAIT or treat it the
> same as O_NONBLOCK.
This doesn't really mean much to me, sorry.

> Adding special checks like this is not a good idea, imho.
That's what Linus said I should do so that's what I did
  https://lore.kernel.org/linux-fsdevel/CAHk-=wimmqG_wvSRtMiKPeGGDL816n65u=Mq2+H3-=uM2U6FmA@mail.gmail.com/

I can't fix the tty layer :/

> > This also masks inconsistent wake-ups (usually every second line)
> > when splicing from ttys in icanon mode.
> > 
> > Regular files don't /have/ a distinct O_NONBLOCK mode,
> > because they always behave non-blockingly, and for them FMODE_NOWAIT is
> > used in the purest sense of
> >   /* File is capable of returning -EAGAIN if I/O will block */
> > which is not set by the vast majority of filesystems,
> > and it's not the semantic we want here.
> 
> The main file systems do very much set it, like btrfs, ext4, and xfs. If
> you look at total_file_systems / ones_flagging_it the ratio may be high,
> but in terms of installed userbase, the majority definitely will have
> it. Also see comment on cover letter for addressing this IOCB_NOWAIT
> confusion.
Reassessing
[1] https://lore.kernel.org/linux-fsdevel/5osglsw36dla3mubtpsmdwdid4fsdacplyd6acx2igo4atogdg@yur3idyim3cc/
I see FMODE_NOWAIT in
  blockdevs
  /dev/{null,zero,random,urandom}
  btrfs/ext4/f2fs/ocfs2/xfs
  eventfd
  pipes
  sockets
  tun/tap
which means that vfat/fuse/nfs/tmpfs/ramfs/procfs/sysfs don't.
(zfs also doesn't, but that's not for this list.)

I don't know if that's actually a "majority" in a meaningful sense,
I agree, but I think I primarily committed to this exclusion because
tmpfs/ramfs didn't.

I s'pose ramfs can already be tagged since it already returns
-EAGAIN when I/O would block (never).

tmpfs not being spliceable is also questionable.

But this'd also mean effectively deleting
  afs_file_splice_read
  ceph_splice_read
  coda_file_splice_read
  ecryptfs_splice_read_update_atime
  fuse_dev_splice_read
  nfs_file_splice_read
  orangefs_file_splice_read
  shmem_file_splice_read
  v9fs_file_splice_read
(not to mention the many others (adfs/affs/bfs/bcachefs/cramfs/erofs/fat/hfs*/hostfs/hpfs/jffs2/jfs/minix/nilfs/ntfs/omfs/reiserfs/isofs/sysv/ubifs/udf/ufs/vboxsf/squashfs/romfs)
 which just use the filemap impl verbatim).

There's no point to restricting splice access on a per-filesystem level
(which this'd do), since to mount a malicious network filesystem you
need to be root.

A denial of service attack makes no sense if you're already root.

(Maybe except for fuse, which people typically run suid;
 that I could see potentially making sense to disable..)


I have indeed managed to confuse myself into the NOWAIT hole,
but this is actually about
"not letting unprivileged users escalate into
 hanging system daemons by writing to a pipe"
rather than
"if we ever hold the pipe lock for >2µs we die instantly".

O_NONBLOCK filtered by FMODE_NOWAIT is used as a semantic proxy for
the 10 different types of files anyone can create that we know are safe.

Anyone can open a socket and not write to it, so we must refuse to
splice from a socket with no data in it.
But only root can mount filesystems, so a regular file is always safe.

And, actually defining a slightly-heuristic per-file policy in the syscall
itself is stupid, you've talked me out of this.
This check only actually applies to the generic copy_splice_read()
implementation, since the "real"/non-generic splices
(fiemap_splice_read/per-filesystem ‒
 all the others that this patchset touches)
are already known to be safe
(and aren't reads so FMODE_NOWAIT doesn't factor in at all).

I've dropped this patch and have instead added this to 01/11:
  diff --git a/fs/splice.c b/fs/splice.c
  index f8bfc9cf8cdc..6d369d7d56d5 100644
  --- a/fs/splice.c
  +++ b/fs/splice.c
  @@ -331,0 +332,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
  +	/*
  +	 * This generic implementation is only safe w.r.t. the pipe lock
  +	 * if the file actually respects IOCB_NOWAIT, which ttys don't.
  +	 */
  +	if (!(in->f_mode & FMODE_NOWAIT))
  +		return -EINVAL;

(Indeed, in many ways, Linus' post to which I reply in [1] pretty much
 says this explcitly. Actually he literally says this. I just don't
 realise and instead of adding the snippet to copy_splice_read(),
 which he already diffed and talks about, I copied it to the syscall.)

Now I just need to re-consider the prose in a way
that avoids this deeply embarrassing IOCB_NOWAIT/regular-file nonsense,
and this series ends up being just "fixing splice implementations"
without also special-casing the syscall itself.

Thanks for asking the right questions.
Sorry for longposting.
diff mbox series

Patch

diff --git a/fs/splice.c b/fs/splice.c
index 9d29664f23ee..81788bf7daa1 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1300,6 +1300,8 @@  long do_splice(struct file *in, loff_t *off_in, struct file *out,
 	} else if (opipe) {
 		if (off_out)
 			return -ESPIPE;
+		if (!((in->f_mode & FMODE_NOWAIT) || S_ISREG(in->f_inode->i_mode)))
+			return -EINVAL;
 		if (off_in) {
 			if (!(in->f_mode & FMODE_PREAD))
 				return -EINVAL;